MMS • Alex Smolen
Article originally posted on InfoQ. Visit InfoQ
Transcript
Smolen: My name is Alex Smolen. I’m the director of security for LaunchDarkly. I’m here to talk about how our security team solved the problem, and by doing so, achieved perfect mental clarity, or at least a temporary reduction in stress. Either way, we think what we learned is worth knowing.
LaunchDarkly
First, I want to talk about where I work and what I work on so that I can put our security problems into context. I work at LaunchDarkly. LaunchDarkly is a service you can use to build feature management into your software. You can deploy code and then turn it on and off with a switch. Why would you want to have these kill switches in your software? I think it’s pretty cool. As a matter of fact, I don’t call it a kill switch, I call it a chill switch. Let’s say you have an outage, or a security incident caused by a bad piece of code. Rather than scrambling to deploy a fix, you can flip a switch. When you’re triaging an incident, the realization that you can end it with a simple flip, it’s pretty powerful. LaunchDarkly’s vision is to create a world in which software releases are safe and unceremonious. That means helping software developers around the world be more chill. The LaunchDarkly security team’s vision is to help our customers’ security teams chill out. We need to solve security problems and show our work so that they can trust us and use our service knowing that our security standards are as high as theirs.
Vulnerability Scanners
Seems easy enough. We actually had a security problem that was decidedly unchill. Vulnerability scanners. You know the story. You’ve run these things before. Let me give you a list of the worst things about vulnerability scanners. The first is that you have a lot of vulnerability reports that you have to triage and deal with. Second, there’s a lot of vulnerability reports that come out of them, and it keeps on going from there. The real worst thing about vulnerability scanners is that there’s so many of them. You’ve got network scanners, OS scanners, web app scanners, DB scanners, code scanners, container scanners, cloud scanners. At LaunchDarkly, we’re a small but mighty security team. We know that if we tried to triage every result of all of these scanners, we’d be overwhelmed. We could turn them all on, but pretty soon, we’d have to tune them all out. We’re also in the process of undergoing a FedRAMP certification, which has strict vulnerability management requirements. It’s a tough standard, there’s a mere, at the moderate baseline, 325 controls that we get audited against. That includes RA-5, which describes how we must perform vulnerability scanning at each of our different layers, and then mitigate or otherwise document our acceptance of vulnerabilities within a defined set of SLOs. How do we deal with the huge volume of scanner results that we have and the required evidence documentation, while keeping our chill level to the max?
Inbox Zero
Security can be a stressful occupation. Our minds are constantly bombarded with problems and potential problems. Our team felt the pain of this chaos. That’s when we realized we weren’t alone, we could build a shelter from the storm. Imagine the clarity of knowing that all of your systems are scanned, and all of the results are accounted for. Your mind would be empty, free. We search for inspiration on how to achieve this highest of mental states, and that’s when we found Inbox Zero. Inbox Zero is not a new concept, it’s about email, that thing that we used to send back in the early millennium. Inbox Zero is the practice of processing every email you receive until your inbox is empty with zero messages. Why would you do this? Because attention and energy are finite. Spending your time responding to messages makes you reactive. When you have a zero inbox, you can actually focus on something, you know there’s nothing more urgent that’s sitting and waiting for you and your dopamine fueled need for distraction and novelty. The principles of Inbox Zero are intended to free us from the tyranny of incoming information, and let us return to our lives.
Inbox Zero as a concept was developed in 2007 by Merlin Mann in a series of blog posts in a Google Tech talk. He was overwhelmed by the torrent of email that he was receiving, so he described how to build walls around your inbox. This means processing your email and deciding what it means to you with a series of actions that he described, the first being delete, or archive. You can go back and refer to it later but get it out of your inbox if it’s not important. Second, you could delegate or send the email to somebody else. You could respond but this should be quick, less than two minutes, or else you’ll forget about the rest of your inbox. You can defer or use the snooze feature to remind you at a more opportune time. Or finally, you could do something or capture a placeholder to do something about it. These actions are in priority order, and the most important step for Inbox Zero is to delete email before it gets to your inbox. He described the importance of aggressive filters to keep the junk out and your attention free. Focus on creating filters for any noisy, frequent, and non-urgent items. The most important part of your Inbox Zero process should be automatically deciding whether a given message can be deleted, and doing so. Sometimes it can be tough to figure out whether you can delete an email or not, just remember every email you read, reread, and re-reread as it sits in that big dumb pile is actually incurring mental debt on your behalf. Delete, think in shovels not in teaspoons. Requiring an action for each email and focusing on finding the fastest and straightest path from discovery to completion is what helps us keep our inboxes clear.
What does this mean for us security professionals? Our time and attention is finite, but the demands on our time and attention are infinite. When he came up with Inbox Zero 15 years ago, Mann knew that it was about more than just email. In his blog post he named checked Bruce Schneier, who once famously said, security is a process not a product, and the same is true of the zero inbox. He also put in his blog post that like digital security and sustainable human love, smart email filtering is a process. Inbox Zero is about email, but it may even be useful for digital security. Then there’s these claims about sustainable human love, where did he get these revolutionary ideas? It turns out that the biggest secret to Inbox Zero is that it’s based heavily on David Allen’s “Getting Things Done” book. In this book, David Allen described an action based system for processing any information or material that lives in an inbox so that you can clear it out. There’s three questions he described that you needed to answer. First, what does this message mean to me, and why do I care? Second, what action does this message require of me? Finally, what is the most elegant way to close out this message and the nested action that it contains? If you’re familiar with it, though, getting things done is more than just an organizational system.
This is directly from a 2007 Wired article, about the proclaimed power of GTD. It’s called, Getting Things Done Guru David Allen and His Cult of Hyperefficiency, within his advice about how to label a file folder, or how many minutes to allot to an incoming email. There is a spiritual promise. Later, it says, there is a state of blessed calm available to those who have taken careful measure of their habits, and made all changes suggested by reason. Maybe personal productivity can get a little culty, but I’m not trying to be like that. I am relatively sure though, if you want to manage your vulnerability scans, and be as chill as our team, you need to Inbox Zero your reports. I’m here to show you how we did that.
It was about a year ago that we started working together on this problem. We knew we needed to scan all of our resources and respond to the results of these scans within a defined timeframe. We also knew that responding to every scanner result is the recipe for a bad time. There’s no central inbox, and there’s no way to determine if it’s zero. Just a bunch of Slack messages, emails, CSV files, Excel spreadsheets, blood, sweat, and tears. We needed a single source of truth that was only filled with items that merited our time and attention. This meant we needed processing so that we could get rid of the inevitable out of date code dependencies that weren’t hit, old container images that needed to be cleaned up, and out of date versions of Curl. These things are like the spam and forward emails from your in-laws messages. The smarter our processing, the happier our responders.
Processing
This processing step was crucial. We wanted to spend our time, attention, and energy with the few findings that actually mattered. Our team got together and brainstormed what this vulnerability processing pipeline should look like. First, we knew that we wanted to automatically suppress all scan results that were known noisy items, and ignore them. Next, we would check if any of the scan results were already ticketed. Maybe we already knew about it, but hadn’t fixed it yet. If so, we could ignore them. Next, we need someone to come in and triage the result. Is it a false positive? If so, don’t ignore it. Write a suppression so that the next time we see it, it’s automatically ignored. Next, you would say, ok, this isn’t a false positive, is it critical? If it’s not, file a ticket, and we’ll work on it with our regular work stream of vulnerability scan results. If it is, ring the alarm, and we’ll declare an incident.
Our Goals
That’s what we wanted to build. We had some goals when we set about to build it. First is we wanted to have it be operationally low overhead. This meant using AWS services for us. We looked at some open source solutions here, but running someone else’s code tends to be a pretty big overhead. We also knew that we wanted the filters to be code based. We thought this would be helpful for making sure that our rules could be really expressive. It would also help us track how we were suppressing various vulnerability scan results so that we could do things like git-blame and figure out why we were suppressing a certain item with context around that change. Another goal we had was to support our FedRAMP requirements around vulnerability management, so we didn’t spend a lot of time in Excel.
Scanning Architecture
We placed this core vulnerability processing pipeline inside of our broader scan, process, and respond framework. On the left is the scanning section, with all of our scanning tools raining down results to ruin our day and our moods. In the middle is the processing section where we standardize and suppress before sending to our inbox. At the center of all this is AWS Security Hub. Then the final section on the right is for responding. This is where our team members get an alert to triage and can quickly get all the information they need to make an informed decision on how to process and spend time on this vulnerability.
We chose several tools to accomplish our scanning goals, the main ones being AWS Inspector, Trivy, Tenable, and GitHub Dependabot. Inspector is an AWS service that performs scanning of our EC2 instances for CVEs and CIS benchmark checks. Trivy is an open source scanning tool that’s used for scanning some of our container images. Tenable tests our web applications, our APIs, and our database. Then we have GitHub Dependabot, which is looking at out of date dependencies in our code that could be exploitable. For these external scanners, we have Lambda code that runs forwarders. It takes the findings out of these external vulnerability scans, and imports them into AWS Security Hub. To do that, it has to convert them into Amazon Security Hub Secure Finding Format, or ASFF. It attempts to decorate them with some contextual information that might be helpful.
Let’s take a look at security hub and how we use this in our environment. AWS Security Hub is a service designed to aggregate security findings, track the state of them, and visualize them for reporting. It integrates with a bunch of AWS services like Inspector but also GuardDuty, Access Analyzer and similar. For us, the killer feature of AWS Security Hub is that it will automatically forward and centralize Amazon Inspector results from different regions and different accounts. This allowed us to just automatically deploy the AWS Inspector agents on our EC2 hosts, and let AWS handle the routing of those findings into security hub.
Processing Architecture
The second section of our pipeline for processing is the most critical, and it’s where we do the actual crunching of our findings and prepare them for our response. The most important part of this is Suppressor. Suppressor, as its name implies, takes a list of new scanning results, and suppresses the noise out of our inbox. It does this by listening to EventBridge. Whenever there’s a new finding reported to security hub, it runs and makes sure that all of the findings go through a set of rules that recategorize and suppress known false positives. When the Suppressor finishes running, it reports the results to S3, where it’s picked up by Panther for alerting. If we dig into one of these Suppressor rules, we can see that they’re written in Python. What they look like is a pretty simple, two methods. One is check, which returns a Boolean if the rule matches the finding as it comes in. Then action, which returns what we should do for that particular rule. In this rule, what we’re looking for is this particular CVE, which doesn’t have a patch available from the operating system maintainer. We may have this ticketed somewhere. We essentially don’t want to receive an alert about it every time it comes up in a new scan. The ability to write these rules in Python, and the templated logic can be really helpful. It allows us to store our entire rule pack in a GitHub repository. Sometimes this configuration as code has some drawbacks, but we have a full CI pipeline where we lint test and deploy our rules. That makes sure that any filters we add are hopefully always going to make things more accurate.
Suppressor isn’t the only piece of our processing pipeline, there are some other supporting Lambdas that sit in this section. First is requeue, which makes it so that any time we update our rule sets in GitHub, we automatically requeue all of our findings and forward them back to Suppressor for reevaluation. This makes sure that even if we update rules, what’s in security hub always matches the state of what we’d expect. We also have asset inventory, which does several things. For the purposes of this diagram, what it does is provide details about our resources so that when we forward data about vulnerabilities, we can annotate it with additional information that will be helpful for responders. Lastly, we also have something called Terminator. What Terminator does is it takes care of findings associated with resources that have been terminated. We have our SIEM, Panther, which listens to CloudTrail logs, and determines when resources are no longer available. It then notifies Terminator, which removes the findings from AWS Security Hub. This can be for EC2 instances, domain names, databases, archived repositories, and so on.
Response
The final piece of our pipeline is the reporting section. Since all findings are reported to AWS Security Hub, we can use its built-in functions for visualization. These findings are then forwarded from Suppressor to our SIEM, which handles routing the actual scan findings to the correct alert destination, and assists us with deduplication of vulnerabilities across groups of hosts or similar resources. This makes sure that whether our finding was discovered on one host or 200, we only get one alert. When everything goes right, our unique process findings are sent to Slack from our SIEM, Panther, with enough contextual information for a team member to quickly triage a new scan finding. Inbox Zero is not just about sending you messages, we need to actually process these messages as human beings. For this, we have this role that we created on our team, which is rotating, called the Security Commander, as well as a Slack triage bot.
The Security Commander is responsible for being the human Inbox Zero responder. Their job is to quickly triage but not fix any new findings that come in. That means that for the Security Commander, their process flow looks a little bit like this. First, they determine, is this alert a false positive? If it is, write a suppression, upload it and make sure that that finding is removed from security hub. If the finding is legit, then determine, is it critical? If it’s not, file a ticket. If it is, then respond and potentially create an incident around this vulnerability. Since most of the time what this means is that the Security Commander is either writing suppressions or potentially filing tickets, it isn’t a particularly high overhead role and allows them to focus on their workday, while keeping us focused on having Inbox Zero.
Our Slack triage bot scans the results as they come into Slack from Panther, and makes sure that we are being responsive to all alerts as they come to us. To assist our Security Commander, this Lambda which is shared across all of our security tooling, helps keep us honest by making sure that we respond to alerts and also preparing metrics about the kinds of alerts we’re seeing and how quickly we’re responding to them. It also provides a couple of shortcut actions for the Security Commander for doing things like creating new tickets for vulnerabilities.
Asset Inventory
Inbox Zeroing your way to vulnerabilities being completely addressed is really great. How do you know that you’re actually scanning all of your important resources? We have a couple of Lambdas that look at our infrastructure APIs and code repositories and output the source of truth inventory to S3, as well as information about the resources being scanned. Like for EC2 instances, do they have Inspector running? Or for GitHub repositories, is Dependabot enabled? We additionally have an Endpoint Checker Lambda, and we use this to make sure that all of our domains are scanned to determine whether or not they’re publicly accessible. If they’re publicly accessible, they should be included in our vulnerability scanning. We do this via just a simple port scan.
Lessons Learned
I wanted to share some lessons we learned while setting up this scanning and processing architecture. First, the ASFF format for security hub is pretty rigid, and we had to fit some of the external findings into it in a little bit of a distorted way. We found that a little bit challenging. We also found it challenging to make sure that everything had a unique finding ID, especially when resources were similar across environments or accounts. Finally, we’ve been weighing the tradeoffs between Inspector V1 and Inspector V2. Inspector V1 doesn’t work in all regions that we want it to. It doesn’t have the same integration with security hub. It also requires its own separate agent on EC2. The big tradeoff is that V2 currently doesn’t support CIS benchmarks. Another thing we learned is that our underlying operating system in EC2 Ubuntu still requires restarts even with unattended updates. What we found is tracking reverse uptime, and making sure that old hosts get rebooted pretty frequently is important. That also means making sure that all of our infrastructure supports being rebooted and restarted. Finally, we found that security hub has some relatively tight rate limits. You can see here that there’s some APIs which can rate limit you relatively quickly, and so we had to rearchitect some of our pipeline to support this.
FedRAMP POAM (Plan of Action and Milestones) Automation
I wanted to share another benefit for having a single source of truth for this vulnerability data. This one’s for the FedRAMP heads out there. We built a few Lambdas to automatically generate our monthly continuous monitoring reports. The asset inventory Lambdas go in and generate a list of cloud resources and their compliance with some of our security controls, things like disk encryption, running security agents, and so on. We then query security hub to ensure that all vulnerabilities in it map to vulnerabilities that are documented in Jira, associate it with what are known as POAMs, or Plan of Action and Milestones. We can then automatically generate the Excel spreadsheets that we need to provide to the federal government that shows that we’re ready to handle federal data.
What’s Next?
Looking forward, we’re excited to make some improvements to this pipeline. First, we want to upgrade to Inspector V2 once it supports CIS benchmarks, which we’re hoping come soon. We’re also looking to add more scanners. I think this is going to be an opportunity for us to take advantage of this filtering pipeline to really make sure that when we do add scanners, we’re getting value from them. We’re hoping to be able to expire suppressions regularly so that they need to be revisited to ensure that they’re still appropriate. We also want to be able to delegate findings to teams. On our security team, we do go in and fix vulnerabilities, but we want to be able to also have the option to send them out to teams to parallelize our effectiveness. It would also be great to have team scorecards, where we could incentivize teams to go in and update their own infrastructure that has vulnerabilities that are close to being out of SLO. Finally, we wanted to take a lot of this data and put it into Snowflake or a similar data warehouse so that we could really slice and dice it, look at it with data visualization tooling. That’s an area that I’m excited for us to work on together.
Summary
Love them or hate them, vulnerability scanners aren’t going anywhere. We recommend that you embrace the avalanche and think in shovels not in teaspoons. Quit lying to yourself about when you’ll actually clean out your inbox. I hope that you’ll all join us in the practice of Inbox Zeroing your way to vulnerability scan tranquility.
Questions and Answers
Knecht: Thinking back and retrospecting, what were the biggest challenges in creating this flow at LaunchDarkly? What were maybe roadblocks you ran into as you were rolling it out?
Smolen: I think one of the biggest challenges was getting consensus as a team about what we were trying to do. I think we all recognize that we had problems, and different people, depending on their role, had a different perception on what the problem was or how to solve it. Figuring out what we wanted the end state to look like, I think was a really big step towards arriving at the solution that we did. That’s why I think vulnerability Inbox Zero, even though it’s a stretch maybe as a metaphor, I think was really helpful for us to all get agreement on what the big picture looked like. Where we were all heading towards, the North Star. That was a challenge, though, to figure that out. I think another set of challenges was related to really getting that source of truth to be high quality. When you’re dealing with data, the challenges are often in not just getting the data into a centralized place, but really making sure that that data is of high quality. Getting all the vulnerability data into security hub, or wherever else you might want to put vulnerability data, that certainly takes some effort. The challenges were really around making sure that it was up to date, that it had all the information that we needed, things like that.
Knecht: That makes sense, especially when you might be having scanners that find the same thing with slightly different words, like deduping, all of those things, definitely. It seems like a hard problem to solve.
One of the things you talked a lot about was Suppressors and not actioning things as the action plan for those vulnerabilities. What do you think about the risk of accidentally filtering out a critical true positive vulnerability, or how did you think about that or weigh that as you were designing the system?
Smolen: I think going back to what you were just talking about with respect to having scanners that are somewhat duplicative, I do think that there’s layers of defense in a security program where you identify vulnerabilities along a timeline. If a vulnerability never gets identified, then, who cares? If a vulnerability comes and goes, and no one ever exploits it, then it’s the proverbial tree falling in the forest. You can discover a vulnerability along a timeline, hopefully not as part of a postmortem of an incident. That’s the scenario you want to avoid. How do you catch it earlier? Obviously, scanners are going to be one way to do that. There may be a world where you identify the issue through a bug bounty, or through some manual penetration testing or something along those lines. Our security program, we do have those multiple layers of defense in place.
I also think that there’s a huge amount of value from what I may call threat intelligence, which in reality can sometimes be like, knowing other people who work in security teams, and hearing like, we’re hearing rumors that people are exploiting this vulnerability. Certainly there is, I think, a huge value in the collective knowledge that we share in the security community, as to whether or not a vulnerability is something that we really need to triple check that we haven’t missed. Because the reality is that the overwhelming majority of vulnerability scanner results do not refer to things that would be exploited. We would all be in trouble if they were.
Knecht: You talked about some of the future shape of what this is going to look like at LaunchDarkly. Has some of that stuff been realized? Have you had any additional or different thoughts about how that might evolve, or does that pretty much look the same as what you mentioned there?
Smolen: I mentioned this, but I think something that is really a focus for us right now is thinking about the data with respect to this pipeline, and how it relates to our asset inventory, and recognizing that as a security team, our decision making can only be as good as our data in some cases. We are running into I think a problem that a lot of software engineering teams run into, which is that, getting good data is almost an entirely separate discipline from software engineering. You see data engineering teams that specialize in capturing data and providing the analysis and visualization tools that enables the business to make good decisions. We are needing to build a similar capability internally for our security team. We’re learning a lot of lessons from the data engineering discipline to help us think about how we run our security program.
Knecht: We have similar stuff going on at Netflix actually, thinking about, what are the practices of data engineers, and quality checks, and all these things that help us to have trustworthy data. I think that’s a hard one to solve, but also an important one.
See more presentations with transcripts