Article originally posted on InfoQ. Visit InfoQ
Transcript
Lopes: Let’s play out a scenario. Imagine yourself going to an online streaming platform of your choice, be it BBC iPlayer, Netflix, Disney, or any others. Here’s the catch, imagine that none of these platforms support user personalization. What does that mean? It means that when you log into any of these platforms, you need to remember the list of the programs that you’ve been watching. I don’t know about you, but I tend to have at least two or three programs in play at any given time on any of these platforms. Many of these programs are available only on a certain platform.
Even if you had a very photographic memory to remember the exact mapping of the program to the platform, what about the watching progress? Would you really remember that? Imagine how frustrating that would be. Let’s imagine there’s another online streaming platform in the market, which remembers the list of the programs that you’ve been watching, along with your watching progress.
Which platform would you prefer? It would be a no-brainer for me to go to the new one. In this simple scenario, we have seen how important personalization is to all of us. No industry has remained immune to it, be it groceries, drugstore websites, travel, leisure, you name it. The very fact that all of you are sitting here in this audience, again goes to emphasize how important personalization is to us.
Background, and Outline
I’m Manisha Lopes. I’m a Principal Software Engineer at the BBC. I’ll be talking about providing a personalized experience to millions of users at BBC. First, we’ll look at what is personalization. In here, I will introduce you to some of the BBC products, and that will help you to get the context for the talk.
Next, we will look at the architecture and the integration of the system, introducing you to different services that we make use of in our personalization product. We will then look at the different steps that we’ve taken to improve the performance and the capacity of personalization product. We will then look at the tooling that has enabled and helped us through this journey. Finally, we’ll talk about how to manage your stakeholder expectations.
What Is Personalization?
What is personalization? Personalization is a process that enables businesses to tailor customer journeys and experiences to the needs and preferences of each customer. Businesses and products that provide a personalized experience to the customer help in saving the time and the effort that the customer spends in finding what they want, thereby increasing the experience and the satisfaction of the customer with that product.
Here we have some of the BBC products listed. Most of you might be familiar with the BBC News which is quite popular. BBC does have other products like the iPlayer, which is our live streaming service for video where you can watch BBC programs on-demand as well as live events. Sounds is the live streaming service for radio, and you can also listen to podcasts. We have Sports, Weather, Bitesize.
We have other products like CBeebies, which is dedicated to children. BBC is a public service, and whatever we do has to benefit you, our end users, and to provide more of what you love we need to understand you better. To do that, we ask our users to sign in to some of our products and services.
The data that you share with us is never sold or shared with any third-party product. It is used to drive a more personal BBC for you, be it recommending programs, or showing you content that is relevant to you. As the BBC, and as a public service, we need to ensure that we make something for everyone, thus making a better BBC for everyone. If you’d like to read more about the BBC privacy promise, I have a link provided.
User activity service, also called UAS, is a highly performant, real-time service that remembers a person’s activity while they’re making use of the BBC. It allows for the collection of the user’s activities and the retrieval of those activities to drive or provide a personalized experience to that user. On the right-hand side, I have a screenshot of the iPlayer homepage. The second rail here, the continue watching, shows you the list of the programs that are being watched by this user along with their watching progress.
To put things in context, we receive around 15,000 to 30,000 transactions per second in UAS. Approximately 150 million activities are stored in a day. Approximately 75 user experiences are powered across different products by UAS. These are the survey results of a survey conducted by Ofcom, that shows the top 10 online video services used by the UK audiences. As you can see, at the very top we have iPlayer at 25%. The second slot we have Prime Video at 18%. The third, we have Sky Sports at 11%. You can really see that we do have a lot of people coming to the BBC iPlayer.
System Integration and Architecture
Let’s now look at the system integration and the architecture. It is a very simple integration, as you can see, between iPlayer and user activity service, where iPlayer web, TV, and mobile interact with UAS via the iPlayer backend. Diving a little further into the UAS architecture. On the right-hand side, we have clients such as iPlayer, Sounds, and the others. There is a common point of entry into UAS, the gateway.
Depending upon HTTP work, the requests are then directed across to the appropriate path. UAS is a microservice based architecture hosted on virtual machines. You make use of topics, and queues, and also a NoSQL database. As you might have noticed, we have a synchronous path and an asynchronous path. Where the user lands on the iPlayer homepage, there’s a request sent to read the user’s activities, or to read by the synchronous path. However, when a user, for example, watches something on iPlayer, the user’s activities will be sent from iPlayer to UAS, which will be eventually persisted into a NoSQL DynamoDB.
Steps for Improving System Performance
When we talk about the steps for improving the performance and capacity of our system, it is basically a three-step process. You need to first identify what are the bottlenecks in the system in order to then go and fix them. It’s an iterative process. We first identify the bottlenecks. We then go and fix them. Then, again, you do some testing to see whether the fixing of those bottlenecks have actually made any impact.
Have they improved the performance of the system? This is something that we have been doing over the years. Finally, we’ll be looking at aligning your data model. As we know, data is central to drive personalization. It’s very important to ensure that your data model is in alignment with access patterns of the data. Here I’ll be sharing with you the different levers that we have used, and also, how did we go about aligning our data model with access patterns?
I did say initially that we have been iterating over making UAS better or improving the performance of UAS for quite some time. Always, you have some surprises. That surprise came to us in the form of lockdown. This is the time when people were confined to their homes with a lot of time on their hands, and with that extra time on their hands, we saw people coming to the iPlayer in record numbers. BBC iPlayer’s biggest ever day was May 10th, and this was the prime minister’s statement regarding the lockdown.
To, again, show you the data, when we looked at the data, we realized that it was almost 61% higher than the same 7-week period the previous year. What happened on 10th of May, when people came to us in record numbers? iPlayer crashed. We had a momentary incident on iPlayer. When we looked at the iPlayer backend external dependencies, this one shows you the interactions with user activity service. You can see around about 7 p.m., there were a lot of errors coming out of UAS.
Digging a little deeper into UAS, we saw there was a huge spike on our DynamoDB. This dashboard shows you the read capacity on one of our tables. The blue data points indicate the consumed capacity. The red line indicates the provisioned capacity or the max threshold that is available at that particular point in time. As you can see, there’s a big spike around 7 p.m., which resulted in UAS throwing errors, and those errors cascaded to iPlayer which resulted in the crashing of iPlayer.
The very first one I will talk about here is simulation exercises. We had an incident on 10th of May. What we did on the back of the incident is, iPlayer backend and UAS had a simulation game. These are nothing but exercises that are carried for the purpose of training, and they’re designed to imitate a real incident. What we did in this simulation exercise is we replicated the 10th May incident. We knew that the bottleneck or the problem for that incident was the capacity available on our DynamoDB.
Once we replicated the incident, we then increased the capacity on our DynamoDB, and we really saw the problems resolved and iPlayer was up and running. Performance testing is another tool that we extensively use for identifying bottlenecks. It is a testing practice wherein you push some load onto a system and then see how the system performs in terms of responsiveness and stability. It helps to investigate and measure stability, reliability, and resource usage. It’s a very effective tool to identify the bottlenecks.
Again, as I was saying, it’s an iterative process. You identify the bottlenecks. You resolve it. You then run the performance test to check whether that bottleneck has indeed been resolved. Another thing to call out is, although you might see there are two or three bottlenecks, it’s always better to make one change at a time rather than making more than one change and not knowing what has boosted and what has affected the performance for them.
We’re going to be talking about the different levers that have helped us and the levers that we have tuned on our application. The very first one is a circuit breaker. This is a very common design pattern in microservices. It is based on the concept of an electrical circuit. If you have a service 1 and a service 2, service 1 will keep sending requests to service 2 as long as service 2 responds with successful responses.
If service 2 starts failing and the failure rate reaches the failure threshold in service 1, service 1 will break the connection or open the connection from service 2. After some predefined period of time, service 1 then will again start sending a few requests to service 2, to check whether service 2 has been restored. If it responds with success, service 1 will then close the connection with service 2. I would like to call out that when we had the incident on 10th of May, we did have a circuit breaker in place in iPlayer.
However, there was a misconfiguration due to which it did not get fired. It’s important to have the circuit breaker but equally important to ensure that it’s up and running. It’s very important to test it and see that it’s actually ok. Now that we have a circuit breaker properly configured, rather than the iPlayer crashing at a time when UAS result in errors, we now provide a degraded service or a non-personalized service to our end users.
Retry requests. In most of our services, we do have a retry with exponential backoff, and that means retrying with increasing waiting times between the retries. Retry request is suitable for those requests which would have failed due to, for example, network glitch. It’s important to remember to set a threshold for the number of retries because you do not want to try indefinitely. Also, important to distinguish between triable errors and non-triable errors.
Triable error, an example would be a network glitch. However, non-triable error would be something like trying to access something for which a user does not have permissions, or sending in a request with an incorrect payload. Batch APIs. This is a nice figure to explain the concept of batch APIs.
Rather than sending single and separate requests to the service, and the service then processing them separately and sending you separate responses, we found that it is much more performant if you batch those requests together, send them to the service. The service will then service them for a system as a batch and then provide you a batch response. Again, it depends upon your use case, and whether it makes sense to have batch APIs for that particular use case.
Web server tuning. Of course, what levers are available depends upon the web server that you’re using. In most cases, once you upgrade to the next stable version, it is bound to give you a performance boost. We make use of Apache Tomcat. When we did upgrade to the next stable version, it definitely gave us a performance boost. The other thing that we changed was we played around with the number of threads available in Tomcat. Once we had that optimal setting in place, it definitely boosted our performance. You might want to check what levers are available for the web server you’re making use of. Token validation, this might be a little subjective.
To put things in perspective, when iPlayer makes a request to UAS, we receive a token as part of that request. The token is used to validate whether it is a valid user request. Historically, UAS made use of online token validation. When we got a request from iPlayer, that token was then sent to another service that, say, is a part of another team, and let’s call it the authentication service. The token was then validated by the authentication service and then passed back to UAS, and then the flow carried out.
After a lot of deliberation, of course, doing threat modeling and checking the security implications, we decided to move this authentication service in-house into UAS, now called offline token validation, where the token is validated and the user authenticated inside UAS. Thus, giving us no coupling, increased reliability because we are no longer dependent on the other authentication service. Of course, low latency because there’s one less hop. It’s definitely worthwhile calling out that changing this would require you doing a threat model, and considering the security implications of this change.
Load balancer migration. Again, you need to check what levers or what configurations are available for you to tune a load balancer. UAS is hosted in AWS. When we started off, we started off with a classic load balancer, which was great, but it was good for handling traffic that grew gradually or organically, but it wasn’t great at handling spiky traffic. Once we migrated to the application load balancer and the network load balancer, it gave us a capacity to handle spiky traffic really well.
General compute, if your performance tests do highlight a bottleneck in your instances, then the decision would be whether you want to scale horizontally or vertically. By horizontal scaling, it means increasing the number of instances in your fleet. Vertical scaling implies going to the next bigger instance. We also migrated to new generation instances, and we did find that this migration to new generation instances was not only performant, but it also was cost effective.
Reviewing your autoscaling policies. Autoscaling is a process that constantly monitors your application, and will readjust or automatically readjust the capacity of your application to provide that stability required to be available and reliable. In autoscaling policies, you have a criteria, and you make use of a metric. You might want to review which metric would make sense in your case. Again, if you make use of a metric like CPU utilization, then you have different thresholds.
Like you have, what is the value of the CPU utilization threshold? If it’s at 60%, and if we want to have a more aggressive scaling, you might want to reduce it and say 50%. Or, what is the evaluation period? How long does that autoscaling criteria need to be satisfied? Or, you could have aggressive scaling in the sense of number of instances added when that criteria is satisfied. Do you just want one, or you want to it to be more aggressive and add two or three instances, depending upon the criticality of your application?
Another thing that we make use of is step scaling. In our application, we know what are the peak periods and what non-peak periods are. During the peak periods, we have a Cron job, or step scaling as we call it, which moves or increases the number of instances available for the period of that peak time. During non-peak periods such as the night time, we downscale the number of instances.
Client timeouts. Client timeouts, again, if you have a service 1 and a service 2, service 1 would send a request to service 2, and a timeout is the amount of time that service 1 is allowed to wait until it receives a response from service 2. Let’s say if service 1 has a timeout of 200 milliseconds. Once it sends a request to service 2, if service 2 does not respond within that 200-millisecond period, service 1 will break its connection from service 2. This ensures that the core services work always even when the dependent services are not available. It prevents services from waiting on it indefinitely. Also, it prevents any blockage of threads.
With this, we have come to an end of the different levers, and now we’re going to look at aligning your data model. Once we had all the levers that I’ve spoken about in place, and again, when we executed the performance test in UAS, we identified that a bottleneck was now on our DynamoDB. What we realized is, when we had a thundering herd event on our DynamoDB, it was not fast enough to scale. Although we have autoscaling set up on our DynamoDB, it’s not scaling up fast enough to handle that incoming traffic, the spikes that we were getting growing within a couple of seconds.
The very first step, what we did was we looked at the pricing model or the different modes that are available on DynamoDB. In UAS, we make use of provision mode, and that satisfies and it works very well for most of our use cases except with spiky traffic. Then we moved on to something called on-demand, which is meant to be really good for spiky traffic. However, on-demand is supposed to be almost seven times the cost of provisioned mode.
In BBC, we have to operate within the limits of a restricted budget. We have to be cost effective. Hence, we decided to stay on the provisioning mode and then check what our options might be with provisioning mode. That’s when we started looking into indexing. We went back to the drawing board, we started to understand, what are the use cases of UAS?
What is the data being used for? Is it being retrieved efficiently? We started talking to our stakeholders. That’s when we realized that the data model is not aligned with access patterns of the data. The answer we found is in the form of indexing. Indexes as in relational database, it basically gives you a better retrieval speed, or makes your retrieval more efficient.
Before we went to the new index, the way we were retrieving information is based on the user ID and the last modified time. In NoSQL, you can retrieve information only based on the key attributes. You can, of course, based on non-key attributes, but you would have to scan the entire table in order to get the data that you want, which is a very expensive operation. The most efficient way of retrieving information is to base your retrieval or query on the key attributes.
Previously, we were querying the data based on the user ID and the last modified time, but when iPlayer wanted the user activities that were bound to the iPlayer, what we were doing in this case, is we’re retrieving all the products, whether it be iPlayer, Sounds, and all the BBC products that send the user activities. Once the data was batched, we were then filtering it out on the iPlayer and returning that to iPlayer.
Now what we decided to do is, we decided to make the product domain a part of our key attribute. Now when iPlayer sends a request asking for the user activities, we retrieve the information only that is associated with iPlayer. Now we reduced the amount of data that is fetched and is fetched only for the product that has requested it. As I said, it reduces the amount of data fetched. It prevents a thundering herd event on our DynamoDB. Less data implies less capacity and hence less time taken, which improves the availability and reliability of our system. Of course, less capacity implies cost savings, which is a win-win.
Of course, there are always challenges. These challenges come in the form of high-profile events like Wimbledon, elections, the world cup, breaking news, and, of course, the state funeral last year. These are events that drive a lot of users to the BBC. Most often what we see is, for a high-profile event like this, you will get a huge spike in the traffic with people coming to the iPlayer, hence our requests coming to UAS to retrieve these users’ activities.
If you’re talking about an event like a World Cup, of course, we would get a spike right at the beginning or around the beginning of the match. Whenever there is a goal, then you’d see spikes at those times. Some of the levers for high-profile events would be pre-scaling your components, because of the type of spikes that you get, the spikes grow within really seconds, and your autoscaling criteria might not satisfy that, to autoscale your application.
You might want to pre-scale ahead of the event. Or you might want to set some scheduling actions on your DynamoDB to increase the capacity for the duration of the event. Of course, monitor your application and make the necessary changes. One of the biggest or the highest high-profile events last year was the state funeral, the Queen’s death. We got around 28,000 requests per second at UAS around 11:57. If you look at the response time, it was around 20 milliseconds. We were really happy with that.
Tooling
Now, going on to the tooling that has enabled us and helped us through this journey. First of all, we have the CI/CD pipeline, this continuous integration and continuous deployment pipeline. It’s a very good practice to treat the infrastructure as code. Previously, we had a problem where people would manually make a copy or duplicate an existing pipeline, and then make the changes manually. However, there was always a problem that there could be some misconfiguration.
There was a lot of time lost there in order to get it working again. Now that we are treating the infrastructure as code, if there’s any problem with the pipeline, we just redeploy it, and everything is as good as new. Resilient one-click deployments. You want to reduce the time between code build to deployment. It’s an iterative process. You identify a bottleneck, you fix it, you deploy it, you then run the performance. You want to be as quick as possible in the cycle. You do not want to have weeks’ delays, or days’ delays. You want to be having really one-click deployments. Of course, more frequent deployments of smaller changes, one change at a time.
Performance testing. The different types of load tests that we perform are benchmarking load tests. First of all, identify what is the capacity of your application without making any changes. BAU load test, business as usual load test, that’s your normal traffic pattern. Identify how your system is performing with the normal traffic pattern. Spike load test, it’s good to know what is the level of a spike that you can handle on top of your BAU load.
Soak test, so running the load test for a couple of hours to see whether any of your services would be stressed out or strained with running for a couple of hours. Of course, it’s really good to know the breaking limits of your system. That’s the uppermost threshold that your system can handle. This information is really vital, especially for high-profile events so that you know what is the capacity, and then you can make those pre-scaling decisions.
Another huge thing that we made was creating real-time load test reports. Previously, we used to run a load test, and then we had to wait for the entire load test to be finished and then have the reports generated. Now, we’ve managed to create real-time reports so that we can see the health of our system, the performance of our system as the load test or the load is being pushed.
Monitoring dashboards and notifications. I think this has been an absolute game changer for us. Of course, we did have monitoring dashboards previously, but now we have increased the granularity and the detailed dashboards that we have. If somebody does ask us how was UAS, or is UAS having any issues? We can go to these dashboards, and we can really drill down and see if there are issues, and where the issues are.
It really helps to understand the health of your system and to track your performance metrics. In addition to the dashboards, we also have alarms and notifications in the form of emails and Slack. Of course, if these alarms get fired, we have well-defined actions mentioned in our runbook. There’s no time loss in trying to understand what is the next step, we know exactly what needs to be done. Hence the fix is much quicker.
Managing Stakeholder Expectations
Let’s look at managing stakeholder expectations. You need to have a good understanding of your complete use case, user journeys from start to finish. In big organizations, there’s often so many teams, and these teams tend to operate in silos. You tend to just work within the bounds of your team. Hence, there is a lot of information and there’s a lot of context that’s lost. Always try to understand the user journey right from the start to the finish, even if there are lots of teams involved.
Have those open channels of communication between the teams. Understand the traffic profile. Of course, know what is the critical path and what is the non-critical path. This will help you in using patterns like retries and circuit breakers and the others. Some tools for collaboration and communication that we use are Slack, which is really real time. We do have product help channels. For example, we have help UAS, where we have all our stakeholders of UAS, and they talk to us. They ask us questions. They get in touch with us if they see any issues with our service. We’re trying to increase and have these communications with our stakeholders. Of course, we use Dropbox for asynchronous communication as well.
Foster a culture of experimentation, be it in the form of simulation exercises as we have seen before, or fire drills. Fire drills could be even a paper exercise. For example, if you want to list, ok, what will happen in the case your database goes down? What is the disaster recovery process? You can have different teams involved. You can have a well-defined process in place with well-defined actions, and the ownership defined. If the situation arises, there is no pressure, and you know you have the set of actions to be performed, and what sequence.
We also have good forums for knowledge transfer and collaboration. We have a monthly session between UAS and our stakeholders, called data and personalization, wherein we go and talk about UAS as a product. We talk about the different endpoints. We talk about our roadmap. We give a chance to our stakeholders to come and ask us questions, to talk to us about their grievances. Also, it’s a good place for stakeholders to talk to each other and share the best practices of what’s working, or what’s not working, or how do they get around this problem. It’s really important to have those channels to transfer knowledge and talk to each other.
Some other considerations that are worth pondering, because we have data, we have databases, so think about the data retention policies. How much value is there in retaining data older than 10 years? Forecast your future demand, not just short-term, but 10 years, long-term, midterm, as well as short-term. Review your architecture and infrastructure. Understand whether it is sustainable or can it handle that forecasted traffic.
Cultural change, incident washups, root cause analysis, are really important. Of course, actions on the back of this root cause analysis to fix that problem so that it does not reoccur again. Of course, no-blame culture. You don’t want to be pointing fingers, even if there’s a failure. What you want to do is make it better. You want people to communicate. You want them to have that safe space for discussion. If you point fingers, then it becomes very restrictive, and people become resolved. Of course, have a well-defined disaster recovery process in place.
Recap
We looked at steps for improving system performance. We looked at how to identify system performance bottlenecks. We looked at some of the levers for tuning your application. Finally, an important thing is also to make sure your data model is aligned with access patterns of the data. We looked at some of the enablement tooling, CI/CD pipeline, performance tests, and monitoring dashboards. Finally, we looked at some ways to manage our stakeholder expectations.
Summary
Understand the complete user journey to know the implications of a change. Break the silos between teams. Open the channels of communication and collaboration. Know what your stakeholders want. Do not make any assumptions. Technology without communication will solve only one-half of the problem. It has to go hand-in-hand. Review, reform, reflect. Look at your architecture, look at your infrastructure, identify the bottlenecks, make the change. Then, again, see whether it has made any impact. This has to be an ongoing process, an iterative process.
See more presentations with transcripts