MMS • Tapabrata Pal
Article originally posted on InfoQ. Visit InfoQ
Transcript
Pal: I’m Topo Pal. Currently, I’m with Fidelity Investments. I’m one of the vice presidents in enterprise architecture. I’m responsible for DevOps domain architecture and open source program office. Essentially, I’m a developer, open source contributor, and a DevOps enthusiast. Recently, I authored a book called, “Investments Unlimited.” It is a story about a bank undergoing DevOps transformation, and suddenly hit by a series of findings by the regulators. The book is published by IT Revolution and available on all the major bookselling sites and stores, so go check it out.
Interesting Numbers
Let’s look at some interesting numbers. By 2024, 63% of world’s running software will be open source software. In AI/ML, specifically, more than 70% software is open source. Top four OSS ecosystems have seen 73% increase in download. Python library downloads have increased by 90%. A typical microservice written in Java has more than 98% open source code. If you look at the numbers, they’re quite fascinating. They’re encouraging, they’re interesting, and scary, all at the same time. Interesting is just the numbers, 63%, 70%, these are all just really interesting numbers. Encouraging is big numbers, just seeing where open source has come to. The scary is, again, those big numbers from an enterprise risk standpoint, which basically means that most of the software that any enterprise operates and manages are out of their hand, out of their control. They’re actually written by people outside of their enterprises. However, about 90% of IT leaders see open source as more secure. There’s another argument against this, the bystander effect. Essentially, everyone thinks that someone else is making sure that it is secure. The perception is that it is more secure, but if everybody thinks like that, then it may not be the reality, in whatever it is, just the fact that 90% leaders trusting open source is just huge. Until this happens, everything goes in smoke when a little thing called Log4j brings down the whole modern digital infrastructure.
Log4j Timeline
Then, the next thing to look at is the timeline of Log4Shell. I took this screenshot from Aquasec’s blog. This is one of the nicest timeline representations of Log4j. This is just for reference. Nothing was officially public until the 9th of December 2021. Then, over the next three weeks, there are three new releases and correspondingly new CVEs against those, and more. It was a complete mess. It continued to the end of December, and then bled into next year, January, which is this year. Then February, and maybe it’s still continuing. We’ll see.
Stories About Log4j Response
Let’s start with the response patterns. I’ll call them stories. As a part of writing this paper, we interviewed or surveyed a handful number of enterprises, across financial, retail, and healthcare companies of various sizes. These companies will remain anonymous. The survey is not scientific by any means. Our goal was not to create a scientific survey, we just wanted to get a feel of what the industry and the people in the industry experienced during this fiasco, how they’re impacted. I think most companies will fit into one of these stories. These stories are not like individual enterprises that we picked from our sample. This is kind of, collected the whole stories and grouped them in three different buckets, and that’s how we are presenting. Even though it may sound like I’m talking about one enterprise, in each of these stories, but they’re not. I also thought about the name of the story, the good, the bad, and the ugly. I don’t think that that’s the right way to put it. We’ll do number one, number two, and number three, as you will see that there is nothing bad or ugly about it. It’s just about these enterprises’ maturity. The stories are important here, because you’ll see why things are like what they are.
The question is, what did we actually look at when we built the stories and created these response patterns? These are the things that we looked at. Number one, the detection, or when someone in a particular enterprise found out about this, and then, what did they do? Number two is declaration. At what point there is a formal declaration across the enterprise saying that there is a problem. Typically, this is from a cyber group or information security department, or from the CISO’s desk. Next is the impact analysis and mitigation. How did these enterprises actually go for impact analysis and mitigation? Mitigation as in stop the bleed. Then the remediation cycles. We call it cycles intentionally, because there was not just one single remediation, there are multiple of those there, because as I said, multiple versions came along in a period of roughly three weeks. Then the cycles are also some things, some cycle that are generated by the enterprise risk management processes that go after this set of applications first, then the next set, then the third set. We’ll talk about those. The last one is the people. How did the people react? What did they feel? How did they collaborate? How did they communicate with each other? Overall, how did they behave during this period of chaos? We tried to study that. Based on these, we formed these different patterns, the three patterns, and I’m calling that, three stories.
Detection
With that, let’s look at the first news of Log4j vulnerability, the power of social media. It came out at around 6:48 p.m., December 9, 2021. I still remember that time because I was actually reading that when it popped up on my Twitter feed. I actually called our InfoSec people. That’s the start of Log4j timeline. With that, let’s go to the detection. On 2021 December 9th, let’s talk about the detection. Note that, InfoSec was not the first one to know about it. Maybe they knew about it, but they’re not the first one. Right from the get-go, there’s a clear difference between the three with regards to just handling the tweet. Number one, cyber engineering, some engineers in the cyber security department noticed the tweet and alerted their SCA, or software composition analysis platform team immediately. Then other engineers noticed that too in their internal Slack channel. SCA platform team sends inquiry email right then and there to the SCA vendor in that evening itself.
Number two, similarly, some engineers noticed that tweet and sends the Twitter link to the team lead. Then the team lead notifies the information security officer, and it stops there. We don’t know what happened after that, in most of the cases. Number three, on the other hand, there’s no activities to report, nobody knows if anybody has seen that, or if they did anything. Nothing to track. Notice the difference between number one and number two. Both of them are reacting, their engineers are not waiting for direction. We don’t know what the ISOs did after number two, actually. The team lead in number two enterprise notified the ISO, and we don’t know what the ISO did after that. Let me now try to represent this progress in terms of some status so that we can visualize it and understand what’s going on here. In this case, number one is on green status on zero day. Number two is also green. Number three is kind of red because we don’t know what happened. Nobody knows.
Declaration
With that, let’s go to the next stage, which is declaration, or in other words, when InfoSec or cyber called it a fire, that we have a problem. On December 10th, morning, the SCA vendor confirms zero day. Engineers at the same time almost start seeing alerts in their GitHub repository. Then the cyber team declares zero day, informs the senior leadership team and the mass emails go out. This is all in the morning. I think they are on green on day one, they reacted in a swift manner that put them in the green status. Number two, ISO notifies the ISO who was informed about this, on the previous day. He notifies the internal red team. The red team discusses with some third-party vendors, and they confirm that it is a zero day. Then CISO is informed by midday. Then, senior leadership meeting during the late afternoon, they discuss the action plan. They’re on yellow because, first of all, that’s late in the day that they’re actually discussing about the senior leadership team, and about action plan. They should have done it early in the morning, I think, as number one. Number three, on the other hand, the red team receives daily feed in the morning about their security daily briefing. Then Enterprise Risk Office is notified before noon. Then the risk team actually met with the cyber and various “DevOps” teams. The first thing they do is enter the risk in their risk management system. That’s why they’re yellow.
Impact Analysis and Mitigation
On the impact analysis and mitigation process, this is about, first try to stop the bleed. Second, where the impact is, which application, which platform, which source code? Then, measure the blast radius and find out what the fix should be, and then go about it. Number one, before noon on December 10th itself, right after they send out that mass email, they actually implemented the new firewall rule to stop the hackers coming in. By noon, the impacted repositories were identified by the SCA platform team, so they have a full list of repositories that needed to be looked at or at least impacted from this Log4j usage. By afternoon, external facing applications start deploying the fix via CI/CD. Note that these are all applications that are developed internally. These are not vendor applications, not vendor supplied or vendor hosted applications. Overall, I think number one has done a good job. They responded really quickly. On the same day, they started deploying the changes to external facing applications. Number two, on the other hand, they implemented the firewall rule at night. You may ask, why did they do that, at night or wait until the night? I do not know that answer. Probably, they’re not used to changing production during the day, or maybe they needed more time to test it out.
Number two, however, struggled to create a full impact list. They had to implement custom scripts to detect all Log4j library usage in their source control repositories. I think they are a little bit behind, because they actually took a couple extra days to create the full impact list. They really did not have a good one to start with. All those custom scripts were created. Number three, on the other hand, they went to their CMDB system to create a list of Java applications from their CMDB. There are two assumptions to it. One is that they assume that all Java applications are impacted, whether or not they used that particular version of Log4j. Number two is they assumed their CMDB system is accurate. The question is, how did the CMDB system know which Java applications there are. It’s, essentially, developers are developing things manually, identify their CMDB entries to have Java programming languages in those. Based on that CMDB entry list, email is sent out to all the application owners, and then project management structure was established to track the progress. Nothing is done yet. Nothing is changed. It’s just about how to manage the project and getting a list. I think they are great, on the fourth day.
Remediation Cycle
Let’s look at the start of the remediation cycle. Of course, there are multiple rounds of remediation. One thing I’ll call out is that in all the enterprises that we spoke to, they all had the notion of internal facing application versus external facing application. Any customer facing application is external facing application, the rest are internal facing application. Every enterprise seemed to have a problem in identifying these applications and actually mapping them with a code repository or a set of code repositories, or an artifact in their artifact repository. It is more of a problem if you use some shared library or shared code base all across your applications across the whole enterprise. Then, we also notice that all enterprises have a vulnerability tracking and reporting mechanism that is fully based on these application IDs in their IT Service Management System, where, actually, the applications are marked as internal versus external, they have to fully rely on that. We may or may not agree with this approach, but this is a common approach across all the big enterprises that we talked to. I think it’s a common struggle.
Back to the comparison. Number one, during December 11th and 12th, they fixed all their external facing applications and then they started fixing their internal facing applications on December 13th. I think they’re a little bit yellow on that side, and I think they struggled a bit in finding out these two sets of applications, internal versus external, and spent some cycles there. Number two, on the other hand, they actually reacted to the new vulnerability on 13th, now they are in the middle of fixing for the first one, the second one came in, and then confusion. On December 14th, they opened virtual war rooms all on Zoom, because people are still remote at that point, 50% of external facing applications fixed their code, and some of them are in production, and most of them are not. Number three, on the other hand, they are manually fixing and semi-manually deploying over the period of December 14th and 16th. They do have incorrect impact analysis data, so basically, they’re fixing whatever they can and whatever they know about. Project is of course in red status, and we don’t disable that, so we put a red status at this remediation cycle.
Remediation cycle continues. Number one, through the period of 13th to 19th, they even fixed two new vulnerabilities, two new versions, Log4j 2.15 and then 2.16. That’s two more cycles of fixes all automated, and then all external facing applications are fixed. Number two, the war room continues through the whole month of December, external facing applications fixed multiple times via semi-automated pipeline. Internal facing applications must be fixed by end of January 2022. They did actually determine that with the firewall mitigation and external facing applications being fixed, they don’t have that much risk with internal facing. They took that towards that, let’s give one month to the developers to fix all the internal facing applications. Number three, some applications are actually shut down due to fear. This is quite hard to believe, but this is true. That they got so scared that they actually shut down some customer facing applications so that no hacker from outside can come in. At that time, they basically did not have a clear path in sight to fix everything that were really impacted. At this point, I’d say number one is green, and number two is kind of, on the 20th day, they’re still red, because they’re not done yet. Number three, who knows? They’re red anyway.
Aftershock Continues
Aftershock continues. Vendors start sending their fixes starting January 2022 for number one, but I think overall, within 40 days they fixed everything. I think they’re green. For number two, it took double the time, but I think they reached, so that’s why they’re green. They at least finished it off. Number three, most known applications are fixed. The unknowns are not of course fixed. As I said, at least number one and number two, they claimed to have fixed everything. The interesting part, maybe it’s because of the number three, is that from Maven Central standpoint, Log4j 2.14 and 2.15 are still getting downloaded as I speak. Which enterprises are downloading? Why? Nobody knows, but they are, which basically tells me that not all Java applications that were impacted by Log4j 2.14 actually got remediated, so they’re still there. Maybe they have custom configurations to take care of that, or they’re not fixed. I don’t know. I couldn’t tell. At this point, if you look at the status, as I said, it’s number one, green, number two, green, and number three is red.
Effect on People
Let’s talk about the effect on people. Most of us, the Java developers, and people around the Java applications, they want to forget these few weeks and months. Actually, it was the winter vacation that got totally messed up. Everyone saw some level of frustration, burnout, and had to change their winter vacation plan. Number one, the frustration was mostly around these multiple cycles of fixes, you fixed it and you thought you’re done, and you’re packing up for your vacation. Then, your boss called you and said that, no, you need to delay a little bit. Overall, few had to delay their vacation for a few days. Overall, no impact on any planned vacation. Everybody who had planned for vacation actually went for vacation happily. Number two, people got burnt out. Most Java developers had to cancel vacation. Many other canceled their vacations around the Java development teams like management, scrum masters, and release engineers, and so on. The other impact of this was that for most of these enterprises, January and February actually became their vacation months, which basically means that the full year, 2022’s feature delivery planning got totally messed up because of that. Number three, people really got burnt out. Then, their leadership basically said that, ok, this is the time where we should consider DevSecOps, as a formal kickoff of their DevOps transformation. We don’t know, we did not keep track as to what happened after that.
Learnings – The Good
That’s the story. We wrapped up the story, or the response patterns. Let’s go through the learnings. The learnings have two sides to it. One is, let’s understand first, the key characteristics of these three groups that we just talked about. Then, from those characteristics, we will try to form the key takeaways. Let’s look at the characteristics for the good one, or the number one. They had software composition analysis tool integrated with all of their code repository, so every pull request got scanned. If any vulnerabilities were detected in that repository, they got feedback through their GitHub Issues. They had complete software bill of material for all repositories. I’m not talking about SPDX or CycloneDX form of SBOM, these are just pure simple dependency lists, and they have metadata for every repository. They also had fully automated CI/CD pipeline. As you noticed here, that unless a team had a fully automated CI/CD pipeline, manually fixing and deploying three changes in a row within two weeks is not easy all across the enterprise.
They had mature release engineering and change management process that are integrated with their CI/CD pipeline that helped them actually to carry out this massive amount of changes within a short timeframe, across the whole enterprise. They also had good engineering culture collaboration, as you have seen during the declaration and detection phases. The developers were empowered to make decisions by themselves. As I mentioned, during the declaration phase, the developers knew that it is coming and it is bad, and it’s going to be bad. They did not actually wait for the mass email that came out from the CISO’s desk. They actually were trying to fix right from the get-go. They noticed it first and then they took it upon themselves to fix that thing. They did not wait for anyone. That’s called developer’s empowerment. Also, the culture that the developers can make a decision by themselves, whether to fix it or not, and actually not wait for InfoSec. Then, of course, this resulted in good DevOps practices, good automation all the way through. Most of these applications in these enterprises, they resided on cloud, so no issue there.
Learnings – The Not-So-Good
On the other hand, the not-so-good, or number two. They had SCA coverage, but limited. It was not shifted left. Most of these enterprises for most of these applications, they scanned right before the release, which was too late. They are on the right side of the process, not on the left side. They did not have the full insight into the software bill of materials or simple dependency tree. They had some. They had semi-automated CI/CD pipeline, which basically means complete CI automation, but not so much on the CD side or the deployment side. The did have good change management process. What I’m saying is that these enterprises are actually very mature enterprises. It’s just that they got hit with Log4j in a bad way. They do have good team collaboration. The cyber engineers saw that, then immediately the engineer passed it on to the team’s lead who passed it on to the ISO. That’s a good communication mechanism and team collaboration. They did have a bit of hero culture, since they did not have full SCA coverage, basically, some of their developers stood up and said that we are going to write a script, and they’re going to scrape the whole heck out of their internal GitHub and report out on all the repositories that had a notion of Log4j in their dependency file, or pom.xml, or Gradle file. With that, they struggled quite a bit.
Learnings – The Bad
Let’s talk about number three, they did not have anything. No SCA, no tooling. As I said, classic CMDB driven risk management. It’s basically, look at your CMDB, look at all the Java applications and just inform the owner, and it’s their responsibility. We’re just going to track from a risk management perspective. They did have DevOps teams, but they were not so effective. They’re just called DevOps teams, probably, they’re just renamed Ops team and not so much DevOps. Of course, as you can tell, that they have a weak engineering culture and silos. Basically, even if somebody saw something, they would not call cyber, because they don’t know that they’re supposed to do that, or they’re empowered to do that, or anybody would entertain anybody’s communication that way. We don’t know, so let’s not make any judgment call. I think, overall, what we found is that they had weak engineering culture across the enterprise.
Key Takeaways
The key takeaways from these. I would like to take a step back and try to answer these four questions. Number one, what are the different types of open source software? How do you bring them in, in your enterprise? Then, once we bring them in, how should we keep track of their usages? Then the last question is, what will we do when the next Log4Shell happens? Are we going to see the same struggle or things are going to improve? Let’s go through one by one. The first thing, open source come in many forms. Some are standalone executables. Some are development tools, utilities. Some are dependencies, that’s the Log4j’s. These are the things that get downloaded from artifact repositories and gets packaged with the actual application. Even source code that gets downloaded from github.com, and then they get into the internal source code repositories or source code management system. They come in different formats too, like jar, tar, zip, Docker image. These are the different forms of open source, Log4j is just the third bullet point, they fall into the dependencies. Most of the time, we don’t look at the other types. They also come in via many paths. First, developers could directly download from public sites. Some enterprises have restrictions around that, but that creates a lot of friction between the proxy team and the developer or the engineering team. Some enterprises chose not to stop that, which basically means developers can download any open source from public sites.
Pipelines can download build dependencies from their continuous integration pipeline. Most of these build dependencies, in all the enterprises that we saw, in number one and number two group, they actually downloaded from an internal artifact repository like Artifactory, or Nexus. For standalone software tools and utilities, many enterprises have an internal software download site. Then production systems sometimes directly download from public sites, which is pretty risky, but it happens. Then, open source code gets into internal source code. We don’t usually pay attention to all of these paths, we just think about the CI pipeline downloading Log4j and all that, but the other risks are already there. All of these are known, if we narrow it down, we could write down more actually. Then, what are the risks involved with that? We need to do the risk mitigation or risk analysis on that anyway.
To track how they’re used or if you want to manage this usage, you firstly need to know what to manage. What to manage comes from what we have. First thing is for all applications, whether they are internally developed applications, or vendor software, or standalone, another open source software, we need to gather all of them and their bill of materials, and store them in a single SBOM repository. This could be, as I said, internally developed application, runtime components, middleware, databases, tools, utilities, anything that is used in the whole enterprise. Then we need to continuously monitor that software bill of materials, because without the continuous monitoring of the software bill of materials, it’s hard to tell whether there is a problem or not, or when there is a problem. With continuous SBOM monitoring, we need to also establish a process to assess the impact. If that continuous monitoring detects something and creates an alert, what is going to happen across the enterprise? What is the process of analyzing that impact? Who is going to be in charge of that? All this process needs to be documented. Then, establish process to declare vulnerability. At which point, who is going to declare that we have a big problem and what is the scope of the problem? Who is going to get notified and how? All this needs to be written down, so that we do not have to manufacture these, every time there is a Log4Shell. Without these, we will run into communication problems, communication breakdown. It will unnecessarily cause a lot of pain or friction, to a lot of folks, that we do not want.
In terms of mitigation and remediation, we need to decide whether we can mitigate only, or we need to remediate or fix the root cause as well. In some cases, just mitigation is good enough. Next, do we need to manually remediate or we need to automate that? It depends upon what it is, how big it is, and whether we have the other things to support us in this automated mitigation process. Prioritization is another big thing, because as I said, external versus internal, or what kind of open source it is, the prioritization will depend upon that, and its risk profile. That prioritization itself can be a complex one and be time consuming one too. Then, deploying the remediation, automation is the only way to go.
The last thing, probably the most important tool that we have is the organizational culture. I’m convinced that the Log4Shell response within an enterprise talks a lot about the enterprise’s DevOps and DevSecOps technical maturity and culture, the engineering culture, leadership style, organizational structure, how people react and collaborate, and their overall business mindset. Do they have blameless culture within the organization? That is also something that comes out, out of these kinds of events. We often use the term shift left security. We most likely mean that having some tools in our delivery pipeline or code will solve the problem, but it is more than that. We need to make sure we empower the developers to make the right decisions while interacting with these tools. Each of these tools will generate more work for the developers. How do we want the developers to prioritize these security issues over feature delivery? Can they design? Are they empowered to design themselves? Lastly, practicing chaos. Just like all other chaos testing, we need to start thinking about practicing CVE testing, chaos testing. Practice makes us better. Otherwise, every time there is a Log4Shell, we will suffer. Spend months chasing this, and we will have demotivated associates, and that will not be good. Because the question is, not if, but the question is, when will there be another Log4Shell?
Questions and Answers
Tucker: Were there any questions in your survey that asked about the complete list of indicators of compromise analysis?
Pal: Yes, partially, but not a whole lot. We just wanted to know the reaction more than the action. We did ask about, how do you make sure that everything is covered? We did ask about whether you completely mitigated, or remediated, or both. That was not our intent of the questions.
Tucker: As a result of all the research that you’ve done, what would you prioritize first to prepare for future vulnerabilities, and why?
Pal: First, we have to know what we have. The problem is that we don’t know what we have. That is because, first of all, we don’t have the tools. Neither do we have a collective process of identifying the risks where we need to focus on. There are many forms of open source that we don’t have control over. There are many ways that we can get them in our environment. We only focus on the CI/CD part, and that too in a limited way. We have a lot more way to go. The first thing is, know what you have first before you start anywhere else.
Tucker: Are you aware of any good open source tools or products for helping with that, managing your inventory and doing that impact analysis?
Pal: There are some good commercial tools. In fact, on GitHub itself, you have Dependabot easily available and few other tools. Yes, start using that. I think we do have some good tools available right away, you should just pick one and go.
Tucker: Once you know what you have that you need to deploy, and you’re able to identify it, I know that you talked about automation for delivery and enabling automated deployment release. What technical foundations do you think are really important to enable that?
Pal: In general, to enable CD, continuous delivery in a nice fashion, I think the foundation block is the architectural thing. Because if it is a monolithic architecture, and we all know that if it is a big monolithic application, yes, we can get to some automated build and automated deployment, but not so much on the continuous integration in the classic sense, and in the continuous delivery in the classic sense. That can be automation, but that’s about it. To achieve full CI and CD, and I’m talking about continuous delivery, not continuous deployment, we actually need to pay attention to what our application architecture looks like, because some architecture will allow you to do that, some architecture will not allow you to do that, period.
Tucker: What are some other patterns that you would recommend to allow you to do that?
Pal: For good CI/CD, decoupling is number one, so that I can decouple my application with respect to anything else, so that I can independently deploy my changes, and I don’t have to depend upon anything else, or anybody else around me. Number two is, as soon as I do that, then the size of my thing comes into picture. The ideal case is, of course, the minimal size, which is the microservice, I’m going towards that term without using that term. Essentially, keep your sizes low, be lean and agile, and deliver as fast as possible without impacting anybody else.
Tucker: One of the things that I love that you brought up in your talk was the idea of practicing chaos. In this context, what ideas do you have for how that would look?
Pal: If we had the tools developed, and we have an inventory of things that we have in terms of open source, just like chaos, let’s find out how many different ways we can actually generate a problem, such as declare a dummy CVE as a zero-day vulnerability, and score 10, and see what happens in the enterprise. Those kinds of things, so that you can formulate few scenarios like that, and see what happens. You can do that in multiple levels. You can do it at the level where it’s just the developer tools on somebody’s laptop, versus a library like Log4j that is used everywhere, almost. There are various levels that you can do that.
Tucker: I love that idea, because I think with these things that we don’t practice very often that could give us a muscle to improve and measure our response over time.
See more presentations with transcripts