Article originally posted on InfoQ. Visit InfoQ
Transcript
Gupta: Around five-and-a-half years back, my husband and I were expecting our first kid. All we could think of in that moment was, how are we going to bring this baby into the world: the day of the birth, and how this baby is going to have such a huge impact on our lives? We knew that raising kids was hard, but we greatly underestimated how hard. Over the last five years, I realized that as my kid grew, there were new challenges every day and we had to constantly adapt our parenting toolkit to raise them into safe and well-functioning adults. Why am I talking about my kids in a technical presentation? It’s because over the last five years of raising my kid, and building different services, what I have observed is that raising kids is a lot like operating a service under evolving threat landscape. The different services, just like kids, operate under constantly changing and unpredictable environment. There are always new threats. With kids, they’re either trying to poke their fingers into electrical sockets, or they’re putting everything in their mouth, running up the stairs. The threats are constantly evolving, and our approach to keep them safe also needs to evolve. The effort of raising a child and operating such services is often underestimated. Both of these operate under a constantly changing and unpredictable environment. The kind of services that I’m referring to here are like DDoS prevention, fraud prevention, things where your threat landscape is constantly evolving. These are the sorts of services that I work on at Netflix.
Background
I am a Staff Security Software Engineer in the customer trust team at Netflix. In our team, we build scalable defense systems, with the goal to stop any malicious activity that can negatively impact or harm customer trust. Some examples of such attacks are when fraudsters try to use stolen credentials to do unauthorized logins or account takeover. I specifically work on leading our anti-DDoS efforts at Netflix. I love to build systems to solve security problems. I’m also a mom of two small boys, and what I have realized in raising these two small humans and building different services is that what works now may not work later. Our security solutions may lose its effectiveness as attacks evolve. With time, as threats evolve, our security system must evolve too. While working on these services at Netflix, we experience this constantly changing nature of attacks quite often.
The Operational Burden
In this graph, the x-axis is the time axis, and the y-axis is the detection rate. When we launch the service, and until that point, that detection rate from that service is zero. Once we launch it, that goes up, and we are happy. We are doing great. Over time that saturates. Then the attackers notice and they evolve. When they evolve, our detection rate goes down. Then we’re like, what’s happening? We modify our defenses and then the rate goes up. Then the attackers respond by changing their techniques. This is a constant rollercoaster. What’s important to note here is that if we don’t evolve, our detection rate is going to go down. What does this mean for us as service owners? The operational burden increases. We constantly find ourselves in a firefighting mode. What do I mean by operations here? Operations stem from two things. Either we are not blocking substantially bad traffic, which leads to an increase in fraud, account takeovers, or a decrease in service availability, for example, in DDoS. Or, on the other end of the spectrum is we start blocking too aggressively, which leads to blocking good user traffic that manifests in the form of increased customer service call. You know that your operational burden is out of control when you find yourself making statements like, “I’m constantly in a firefighting mode. Every time I sit down to code, I get interrupted by a page. There’s too much unplanned operations work this quarter. I don’t know if we are seeing more attacks, or if my service has an issue. I have no time to do feature development. I’m not sure quite sure what happened with this attack.” I believe I’m not the only one here who found themselves in this situation. I’m sure there are more of you out there.
Example: Challenges at Netflix
While working on the DDoS service at Netflix, this is where I found myself, that the operational overhead was going up, and I was constantly in a firefighting mode. Looking at a service, I realized that our challenges were in three main categories. The first one was visibility. At that point, I realized we didn’t really have a good understanding of our DDoS landscape. We knew what was happening in the moment, how the service was performing at that moment. If we look at an entire threat landscape, how many DDoS attacks were we seeing, how were they related? We didn’t have an idea. How a service was responding to these over a long period of time, were we doing better? Were we doing worse? We didn’t know. The second one was investigations. That’s when our service did not do a good job in blocking the DDoS traffic. Why was that? I’m a backend engineer and my data skills are not that sharp. It took me hours to write Hive queries, look at the data. The tables were extremely big, so these were very slow. It took several hours to even figure out what was going on. I ended up writing queries which were wrong, so I had to rewrite them. This was a big time sinkhole. Then, the third one is operations, which is basically about, we didn’t have an idea of what was the false positive or the false negative rate that we are seeing with our service? Do we need to fine-tune our defenses? How do we need to fine-tune it?
What Are the Options?
Faced with all of these, I thought to myself, what are our options now? The first one is, either I figure out how to scale operations, which is, we hire more folks. That’s not always an option given the resourcing budgets and everything. Or we can build tooling to solve this. The second option is, my service is not doing great so let me just build a new one. While that may be the right solution in some cases, what I have observed is that in most of the cases, when you build a new version, it’s like, we end up repeating the same mistakes. It’s very hard to get feature parity, and it becomes this vicious cycle. What I have realized is that systems don’t change, mechanisms change. The core implementation of your service is not what leads it to become ineffective over time. It’s the mechanisms, it’s the rules, the models that you integrate with. This is the key point of this presentation. If we have to figure out a way of, how do we scale these defenses in a threat landscape that is constantly evolving, we need to figure out a way to scale operations. What are my options here? At Netflix, we think beyond the core business logic. We think of it as an entire defense ecosystem. What do I mean by that? A defense ecosystem is a continuously evolving ecosystem. It observes the attack behavior and how we are responding to it. It learns the new attack patterns, and then it responds. It adapts by modifying and deploying new rules and models to these constantly evolving threats.
The Defense Ecosystem – Decisioning Service
How do we build this defense ecosystem? I’m going to talk about the four components of the defense ecosystem. The first component is the decisioning service. This is the service that contains the core business logic that is used to decide whether a request is fraudulent or not. This is what your important flows like, for example, the login flow integrates with. The operational challenges from this decisioning service, they manifest in four forms. Either, we are blocking good users or we are not blocking enough bad traffic, our decisioning is slow, or the availability is low. When we are building our defense service, our decisioning service, we have to design to reduce operational overhead.
When we do this, we build this service with the following key objectives. We want to have a service that has low false positive rate and a high detection rate. When either of these things are not working as intended, we need to design in a way that we can respond fast. We want a service to have low latency so that it can do faster decisioning. We want it to be highly available. Just consider the DDoS defense service, if that service is not highly available, how can it guarantee the availability of your entire infrastructure? These are our five key objectives. It’s easier said than done to have all of these like right, because if you look at it, some of them are inherently tradeoffs. If I want to have something that’s highly accurate, then I need to spend more time looking at the request, running a whole bunch of machine learning models, and that would lead to high latency. What we have realized is that building all of this into just one service is often not the right solution, so we look at it from a layered defense approach, where we break our service down into various components, and we deploy them at various layers in our infrastructure.
Let’s look at an example. Let’s look at login attacks. There are three login attacks that I’ll talk about. The first is credential stuffing, which is the high-volume login traffic where someone gets pulled off a stolen credential dump, and they’re just spraying the infrastructure with millions of requests. If they get few of them right, then that’s all they want. The unauthorized login is when someone has the username, password of an account that they don’t own and they are using that in an unauthorized manner. The third one, account takeovers, is a step further. That’s when they not only log into your account, but they also kick you out by changing the email and password, so now they basically own the account. When you’re building defenses against attacks like login, you could think of defenses in sort of a funnel. At the top level is the edge. That’s where the requests enter our infrastructure. This is where we deploy a solution that’s low latency. It doesn’t have to be 100% accurate as long as it blocks 70% of the traffic, with super low false positive rate, we are good. As we go down in our infrastructure, we now have more leniency in terms of latency but we also have higher requirements in terms of accuracy because now we have more context. We have opened up a request. We have a whole bunch of different information about this request. Now our decisioning can be much better. What’s important to note here is that when we are looking at this layered defense, we have different contexts at different layers. We have different requirements in terms of latency, accuracy. We want to be able to block as early as possible when our request is coming into the infrastructure, but as we go down the funnel, we have more context for a better decision.
The second one is the rules and configurations that we need for our service. Where do we store them? Because this will define how the operations show up later. One option is to keep this in our code. Here’s a hypothetical example. We are trying to figure out if a request is from a bot. The request comes to a login endpoint. We take the IP address. We see how risky this IP address is. If it’s more than a certain limit, which is 5 here, then we say that, ok, block it. The advantage of having this in code is that we can have much more complex logic, but it also means that the response is going to be slower. Every time we need to make any changes, we have to modify the code, do a new build, and deploy it. That takes time. The second option is, have this as a static configuration. This is slightly better in terms of response. For example, if you need to change the minimum risk score from 5 to 7, you just need to go and make a change here and just redeploy the service. Now it’s much easier to keep all the configurations in one place, and you don’t need to make code changes.
The third one is using dynamic configuration. This could either be in a key-value pair. Here, when you make a change, that is instantaneous. The moment you make a change, your service takes it up and your response is super-fast. At the same time, too many key-value pairs can lead to a config explosion. You’ll start to lose visibility of what all configurations do you have in place? Are they conflicting? If there are multiple of them, how do you make sure which one takes precedence? The dynamic configuration doesn’t necessarily always have to be key-value paired, you could also think of it as a JSON file or a YAML file so that you have a better visibility of what’s going on. That can get loaded into your system frequently. In our services at Netflix, we use a combination of all of these three, based on how new the service is, how mature the service is. If it’s still in the early stages, we start with putting things in code, or as FaaS properties, but as things mature, we try to pull this out into a configuration file.
For these, when we are looking at the operational considerations, there are certain things we need to keep in mind. The first is that we should be able to respond fast. With some options, for example, like the dynamic configuration we can respond much faster, as compared to the code where it takes slightly longer to respond. We need to have our configuration storage in a way that it’s highly available, because our decisioning service can’t really function without that. We also need to think about visibility, so what I was talking about in terms of like the key-value pairs. If you have too many of them, you really don’t know what’s going on. Are they conflicting? Which one is actually taking effect? Having that visibility into the entirety of your configuration is also important. The fourth one, the self-serve. That may not be applicable to all fraud services, but in some of the fraud services what we have realized is that fraud service is used not just by one flow, but by multiple different flows, which are owned by different teams. In terms of operations, sometimes if we are the only one who can make these changes, it’s slow, and it’s a lot of operational overhead on our end. By making our configuration self-serve, we give this ability to different teams to make changes for their own flows.
The third one that I wanted to talk about was dependencies. To reduce the operational overhead and be able to scale our defenses, it’s important to reduce our dependencies, because the SLA of our service is going to be defined by the SLAs of what we depend on. It might be tempting to use a shiny, new technology which is out there. We have to stop and rethink, how is this going to affect the SLA of our service? We have to reduce fanout. The paved path may not always be the right solution for such services. Let’s look at an example. For example, if we are talking about the DDoS service, and the DDoS service uses Kafka streams to generate the rules. If there’s a DDoS happening, and that brings down the Kafka service, then your service will not be able to have any new rules, which means it will not be able to stop the DDoS attacks, and that renders the service useless. It’s important to reduce fanout. It’s important to look at the dependencies of these services to make sure that they don’t go down as the threat landscape evolves and changes.
The Defense Ecosystem – Intelligence Component
The second component is the intelligence component. This is basically what helps us identify what is bad and what to block. This is the brains of your system. It’s like the rules, the models, the feature store, tables. This is like the secret sauce, so I can’t go too deep into it. It also depends on which service because we have various kinds of different services. The key thing to note here is that the intelligence becomes stale with time. If you think of the antivirus software, if you don’t update the signatures, that software is effectively useless. It’s similar, you need to update the intelligence or the systems will stop being effective. We do this mostly manually now, by revisiting our rules and configurations every so often. We are also working on developing EDS, which will allow us to semi-automatically look at what’s working, what’s not working, and then respond to it, instead of manually doing it all the time.
The Defense Ecosystem – Observability
The third component is the observability component. This is one of my favorite ones. Earlier this year, we saw a significant increase in the number of DDoS attacks, which were causing SPS impact. SPS is streams per second, and this is the key metric that we use to measure the availability of Netflix. There is a very interesting Eng blog post on this as well. We saw an increase in DDoS, which were causing SPS impact. Around the same time, Russia invaded Ukraine. What happened then was there was also an increase in the DDoS activity around the world. The question that our leadership was asking us was, is there a correlation between this recent increase in DDoS at Netflix, and the Ukraine-Russia conflict? Honestly, we didn’t know the answer to that question. The reason that was the case was because we did not have a long-term visibility into the DDoS attack trends. We only look back a few weeks. Other than that, we didn’t have the data so we couldn’t really answer that question. We were like, let’s go ahead and solve this problem, because this is important to know. That’s when we started investing in an observability system. Let’s dive deep into what this meant.
First is, when we are looking at understanding the threat landscape, there’s micro-visibility, which is, you take an attack and you zoom in on that. You see how that attack is behaving. How are we responding to it, and build metrics around that. Let’s look at a DDoS example. The blue line here shows a DDoS attack. There’s baseline traffic. There’s a huge spike. Then, after a while, the attack goes down. The red line is the blocked traffic. As you see, we are really not blocking much. Then as the DDoS begins, we slowly start ramping up. Then after a point, we are blocking all of it, and that’s when the DDoS attack stops, and goes away. When we are building metrics for this, the first challenge that we ran into was, how do we even identify the attack in data? When we are looking at the data, we are looking at the request level. We don’t know if this request is a DDoS request, or if it’s a normal request. What we did was we developed a spike detection algorithm. We looked at the service traffic, and whenever there was a sharp increase, and a sharp decrease, we said that, ok, this is when a DDoS attack starts, this is when a DDoS attack stops, and this is one DDoS. Now, once we had that, now we could take those values and now figure out what was going on with this particular DDoS. Then we looked at, how long was this attack? What was the volume of the traffic? What was the spread of key features that we use to block a DDoS? Then we also looked at the response. What was our block rate? How long was our response time? Was there any service impact? If yes, how much? This is the micro-visibility into a DDoS attack.
The second one is macro-visibility. Now if I zoom out, and now I’m looking at all these different DDoS attacks over a much longer period of time, what is it that I’m seeing? As an example, we build out metrics to now figure out, how many daily DDoS do we see? When we were building out the metrics, we broke our metrics into two components. The first one was high-level metrics. High-level metrics is something that is useful not just to us, but also to the leadership to understand what is going on in our infrastructure in terms of DDoS. This will show what the attack trends are. Which endpoints are more hit by DDoS? How are we responding to it? What’s the cost impact, and so on? The second kind of metrics is the operational metrics. This is something that’s used by us as service owners and operators to understand how our different services are doing. How well are we responding to these DDoS attacks? Which rules, which models are the most effective, are the least effective? If we have to look at multiple attacks and dive deep, what metrics do we need? This is what we built in terms of the macro-visibility for the DDoS landscape.
This was a lot. When we started off, we basically had nothing. We just had an idea of, yes, we want to improve the visibility of our system. We want to understand things better. We want to improve our investigations. The question was, where do we even start? We started with the key questions. What are the key questions that we need answers to? We listed out all of these key questions. Next, we were like, what are the data requirements that we need to answer these questions? Do we have this data already, or do we need to build these data pipelines? For the second case, then we went ahead and built the data pipelines and the workflows so that now we have all the data that we need to answer the key questions. As the final step, we built dashboards, so that this data was now easily accessible to us to consume.
The Defense Ecosystem – Response
Let’s move on to the final and the fourth component of the defense ecosystem, which is the response component. In terms of response, there are two kinds of response. First is the proactive response. This is when we are doing continuous monitoring of our system to see how it’s performing. How are the false positives looking like? How are the false negatives looking like? Which rules are effective? Which rules are not effective? Then, respond by tuning these rules and models, and updating our defenses. The second one is the reactive response. This is when things don’t go as expected. There is an alert, there’s an escalation, and we need to investigate it. One of the things that we use is Elasticsearch, but sometimes that doesn’t have all the context. Like I mentioned earlier, when we had to do investigations, I spent hours writing Hive queries to pull data from a table, which was huge, so it was super slow. Instead, what we did was we built notebooks. In these notebooks, now I could specify that, ok, this is when the attack happened, this is when it started. This is when it ended. This is the target service that it hit. Now, why didn’t I block enough, tell me? This notebook would then run. It would look at the traffic that we did not block, and it will break it down by various reasons on why we didn’t block it. While we’re still in a preview mode where we were observing what’s going on, but not quite blocking it. Or maybe we were just ramping up, we didn’t quite get there in the first few minutes or so. Or, if there are different reasons. We can have different reasons on why we wouldn’t block something. This notebook now showed that to us. Now, the investigation which used to take us 6 to 8 hours to run would now take 15 to 30 minutes. That is a huge efficiency saving in terms of the developer time. Some of the early wins that we saw when we invested in building these observability and response systems were, we now had a much better visibility of what was going on in our infrastructure. Our investigation and response was much faster. As a side effect of building the data foundations that we needed to build the metrics for visibility, and so on, we ended up building data foundation, which would also help us at improving our services’ effectiveness. It would lead to much better signal generation.
Caveats
The first caveat is, when should we build these various components of the defense ecosystem? If I have to go back to that kid analogy. To raise a kid, you need an ecosystem. You need the school, the pediatrician, the enrichment classes. You don’t start thinking of that from day zero, you wait till your kid is a bit older and when the timing is right, that’s when you start investing in such things. The same is true for a defense ecosystem. When you’re just building out the service, you invest in the decisioning service and some bit of the intelligence. As the service matures, it evolves, the operational overhead starts to increase, eventually, that’s when you also start building into it the observability system, the response system to make things much better. The second learning that we had was that it is important to staff the team right. When we started off with the service, we just had backend engineers working on it, with some support from the data analytics, but it was still mostly the backend engineers. That’s why we focused so highly on the decisioning service part of it, but not so much on the observability and the response. When the need for that started, we also brought on our data engineers, data analytics, and also improved our operations. Once we did that, we saw a significant increase in how we operated as a team, and how our different services were doing.
Key Takeaways
Let’s review. If there’s just one thing that I would like you to take away from this talk, is that it’s important to think beyond the core business logic. It’s important to invest in the entire defense ecosystem. What I mean by that is, it’s important not just to have the decisioning service, but also the intelligence component, the observability component, and the response component. Yes, think beyond the core business logic. Think about the entire defense ecosystem.
Questions and Answers
Knecht: Your talk is an awesome focus on how you’re still solving this problem at Netflix. Can you talk a little bit about why distributed denial of service or DDoS, is a particularly difficult problem to solve?
Gupta: With DDoS, what happens is we get a huge amount of requests in a very short amount of time. There are two things, first is we have to respond very fast, because if we take our own sweet time to respond, the damage is done. Every second is important. We have to respond very fast. The second thing is we have to respond with very limited context. When you have to respond fast you don’t have the luxury to run a lot of machine learning models, like open up the request, look at the payload. You operate with very less information. With very less information, we have to respond fast, and we have to make sure that we don’t have false positives, because often when you’re operating with less information, you tend to be very aggressive and that can lead to blocking good users. We don’t want that. Trying to make sure that we are responding fast with limited context, and not have false positives, that’s the main challenge that I feel with defending against DDoS attacks.
Knecht: You mentioned false positives and false negatives, I think you have the other side of that too. Can you talk about how you measure false negatives of your system or metrics, or things that you use to measure how well your system is doing in blocking?
Gupta: In one of the previous slides, I showed how a DDoS attack looks like. We have a baseline number which is what we usually see in terms of traffic. When there’s a DDoS, there’s a spike that lasts for a while, and then it goes away. When we look at our defenses, we also look at the metrics of how much traffic we’ve blocked. That also follows a similar trend. You have a baseline, then when there’s a DDoS, we start blocking a lot more traffic. Then once the DDoS goes away, we go back to our normal. That difference of the spike traffic is the net volume of the DDoS we see. Then, the traffic we block during the spike is what we block. The difference is basically what we did not block, that is the false negative. That’s one metric that we use to know that we are not blocking enough bad traffic. We also have alerts on the service availability, so if there’s a DDoS and we are not able to respond fast enough, then we do get a service impact. On that, we get paged. Those are the two things that we look at.
Knecht: I think DDoS and protecting a service that’s as popular as Netflix probably has a lot of ambiguity in it, or certainly does have a lot of ambiguity in it. You might not hear about something like we accidentally blocked a user or something like this. How do you measure false positive rates in systems where it is difficult to identify the ground truth data?
Gupta: I think for those, we look at two things. First is, what is the expected traffic pattern? With Netflix, we know that there’s an expected way that the traffic is supposed to look like. If we deploy a new configuration and look at just the blocks by that configuration, if we see that that is also following the same trend, then that means it’s most likely blocking good users. That’s one of the signals that we use. The second is that if something goes wrong and we are blocking good users, we do get contacted by customer service. We work very closely with our customer service folks. If we start seeing that we deployed a configuration and the calls to customer service went up, that’s one strong indicator that we have on, “Ok, something wasn’t right, this is what we deployed. Let’s turn it off and investigate.”
See more presentations with transcripts