MMS • Anderson Parra Vitor Pellegrino
Pellegrino: We’re going to talk about how we at SeatGeek, we’re able to handle high demand and ticket on-sales, and we’re able to do that in a successful way, preserving the customer experience throughout the process.
Parra: I’m Anderson. I work in the app platform team at SeatGeek as a senior software engineer. I’m the tech lead for the virtual waiting room solution that was made at SeatGeek.
Pellegrino: My name is Vitor Pellegrino. I’m a Director of Engineering, and I run the Cloud Platform teams, which are the teams that support all the infrastructure, and all the tooling that other developers use at the company.
Let me talk a little bit about what SeatGeek is. SeatGeek works in the ticketing space. If you buy tickets, sell tickets, this is pretty much the area we operate in. We try to do that by really focusing on the better ticketing experience. It’s important for us to think about the experience of the customer, whoever is selling tickets as well. This is our main differentiator. Maybe you will recognize some of these icons of these logos here, so we’ve been very fortunate to have partnerships with some of the leading teams in the world, not only for the typical American sports, but also if you’re a soccer fan, and maybe you know the English Premier League, also clubs like Liverpool or Manchester City, they work with us. If you buy tickets for them, you use SeatGeek software.
The Ticketing Problem
Let’s talk about what actually the ticketing problem is, and why we felt like this could be something interesting. When we think about ticketing, if you’re like me, most of your experience is actually trying to find maybe a concert to attend, or maybe you would like to actually buy tickets to your favorite sports team, or maybe a concert, or what have you. You’re actually trying to see what’s available, you’re going to try to buy a ticket. Then after you’ve bought the tickets, you’re eagerly waiting for the day to come for the match or whatever event that you have, and then you’re going to enter a stadium. This is what we would call the consumer aspect, or the consumer persona of our customer base. Maybe I own a venue, actually, I represent one of the big sports clubs, or I actually have a venue where we do big concerts. In that case, my interests are a little bit different. I want to sell as much inventory as quickly as possible. Not only that I want to sell tickets, but I also would like to know how my venue is. Which ones were the most successful? Which ones should I be trying to repeat? Which one sold out as quickly as possible? Also, I would like to manage even how people get inside the stadium. After I sold tickets, how do I actually allow people to get inside a venue safely at a specific time, without any problems?
Also, a different thing that we handle at SeatGeek is the software that we build, they run in different interfaces as well, which means that we have the folks that are buying tickets on mobile. We have the scanners that allow people to get inside a venue. Also, when they get inside a venue itself, they might have to interact with systems with different types of interfaces, different reliability and resilience characteristics. This is just one example of one of these physical things that run inside a stadium, like at the height of the pandemic, we were able to design solutions for folks to buy merchandise without leaving their seats. That’s just an example of software that needs to adapt to these different conditions. It’s not only a webpage where people go, there are a lot of things to it.
About the characteristics of ticketing itself, so in a stadium, there is a limited amount of space. It may be very large stadiums, but they all are at capacity at some point, which means that it’s very often especially for the big artists, or something, like the big events, usually you have far more demand than you have inventory for. Then what happens if many people are trying to get access to the same seats, how do you actually tiebreak that? There are concurrency issues, like maybe two people are trying to reserve at the same time, how do you tiebreak? Maybe you need to also think about the experience, like it is whoever saw the space first, it is whoever was able to actually go through the payment processor first. Who actually gets access to the space? This is something very characteristic of this problem.
Normal Operations vs. On-Sales
Most of the times we are in a situation where we call it the normal operations. People are just browsing the website trying to find things to attend. Maybe coming back to the example that I brought, I just want to see what’s available. To me, it’s very important to quickly get access to what I want, and have that seamless browsing experience. Maybe there is what we define as an on-sale. This is actually something that has a significant marketing push behind, there is usually also something happening in an outside world that impact the systems that we run. Imagine that I’m Liverpool, and I’m about to release all the season tickets at a specific time in a specific day. People are going to SeatGeek at that specific time trying to get access to a limited amount of inventory. These are very different ways that our systems need to understand. Let me actually show you how that looks like. The baseline is relatively stable. During an on-sale, you have far more traffic, many orders of magnitude, sometimes more. If you’re like me, before I joined this industry, I thought like, what’s the big deal? We have autoscaling now. We tweak the autoscaling, and that should just do it. It turns out, that’s not enough. Most of the times autoscaling simply cannot scale fast enough. By the time you actually start to see all this extra traffic, the autoscaling takes too long to recognize that and then add more capacity. Then we just have a bad customer experience for everybody. That’s not what we want to do.
Another example of things that we have to think during an on-sale. When we think about the tradeoffs, like the non-functional requirements, they might also change during an on-sale. Latency is one example. Maybe I would say that in normal operations, I want my website to feel really snappy. I want to quickly get access to whatever is happening. During an on-sale, I might actually trade latency for more redundancy. Maybe even during an on-sale, I’m going to have different code paths to guarantee that no request will fail. If I have a 500 request, it’s never a good experience, but it’s far more tolerable if that happens outside of a very stressful situation that I waited in line perhaps hours to get access to. This is something very important.
Even security. Security is always forefront on everything that we do, but detecting a fraud might change during an on-sale. One example. Let’s say that I’m trying to buy one ticket, but maybe I’m trying to buy a ticket not just for me, but for all my friends. I actually want to buy 10 tickets, or maybe 20. I have a huge group. How much is tolerable? Is one ticket ok? Is 10 tickets ok? Is 20 tickets ok? We also see that people’s behaviors change during an on-sale. If I actually open a lot of different browsers and a lot of different tabs, and I somehow believe that that’s going to get me a better chance to get access to my tickets during an on-sale, is that fraudulent or not? There is no easy answer to these kinds of things. This is something that may change depending on whether you are doing an on-sale or not. The main point here being that you must design for each mode of operation very differently. The decisions that you take for one might be different from the decisions you take for the other one.
Virtual Waiting Room
Parra: We’re going to talk about the virtual waiting room, that’s a solution inside of SeatGeek called Room. It was made in SeatGeek. It’s a queuing system. Let’s see the details of that. Before we talk about virtual waiting room, let’s create a context about the problem where the virtual waiting room acts. Imagine that we’d like to purchase a ticket for an event with high demand traffic, a lot of people trying to buy tickets for this event as well. Then we have the on-sale that starts at 11:00, but you arrive a little bit earlier. Then we have a mode that we’re protecting the ticketing page, then you are in the waiting room, it means that you are waiting for the on-sale to start, because we’re a little bit earlier. Then at 11:00, imagine that on-sale starts, and then you are settled in the queue and you are going to wait for your turn to go to the ticketing page and try to purchase the tickets. We know that queues are bad. Queues are bad in the real world and queues are bad in the virtual world as well. What we try to do, we try to drain this queue as fast as possible.
Why Do We Need A Queuing System?
Why do we need a queuing system? We talk about that scale policy is not enough. Also, there are some characteristics of our business that requires a queuing system. For example, we need to guarantee the fairness approach to purchase tickets. In the queuing or in the online ticketing, the idea that you can have the first-in, first-out, who arrived earlier has the possibility to try to purchase the tickets earlier. Also, the way that you control the operation, you cannot send all the holdings to the ticketing page, because the users operate reserving and then finishing the purchase tickets. In finishing the purchase, a lot of problems could happen there. For example, a credit card could be denied or you realize that the ticket is so expensive, then you say, ok, then you give up and then the ticket becomes available again, then becomes available for another user that they have the opportunity to try to purchase that. It means that we need to send our holdings to the ticketing page in batches. Also, we need to avoid problems during the on-sale. We’re talking about the characteristics of different modes that are operating in normal and the on-sale mode. The on-sale mode is our critical time, everybody is looking to us and we’re trying to keep our systems up and running during this time. We’re controlling the traffic, the opportunity that you can try to avoid problems during the high demand traffic.
The Virtual Waiting Room Mission
Then, what’s the mission of the virtual waiting room? The mission of the virtual waiting room is absorbs this high traffic and pipes it to our infrastructure in a constant way. The good part, this constant traffic we know we can execute for example loading tests, then you can analyze how many requests per second our system supports. Then based on this information, we can try to absorb that spike of the high traffic, and that pipes it in a constant way to our infrastructure in order to keep our systems up and running.
Considerations When Building a Queuing System
Some considerations when you’re building a queuing system. Stateless, in the ideal world, if you can try to control the traffic in the edge, in the CDN and avoid requests going to the backend, that you don’t need to render that request at that time is the best way. However, there is no state in the CDN. With no state means that you cannot guarantee the order. Order matters for us. Then we need to control who arrived earlier, then to drain those people from the queue earlier. Then usually when you have the stateful situation, the status controller in the backend, traditionally, then we need to manage the state of the queue in the backend. If you have the queue, then you need to talk about possibilities to drain this queue, could be for example random selection. We have a queue then you can select few users from that queue and remove and then send to the ticketing page for example.
Again, first-in, first-out is the fair approach that we can use in our business. We choose first-in, first-out. Also, we need to provide some operation, some actions that operators can do during their on-sales. They can do, for example, increase the exit rate of the queue. They can pause the queue because there is a problem in one component of our system. They need to communicate with the audience that is there, for example, if it is sold out, you can try to broadcast that information as fast as possible, because then people don’t need to wait for something that they cannot find. The important thing, metrics. We need to get metrics for all the components to analyze what’s going on. We need to identify the behavior of our system in terms of how long is the queue, or how much time a user was spending to purchase the ticket. Then, how was the behavior of our components during the on-sales? This is important to make decisions to improve our systems.
Stateless or Stateful?
Virtual Waiting Room Tech Stack
How Does It Work?
We’re going to see in detail how it works. The virtual waiting room operates in two modes. The virtual waiting room, basically where the on-sale didn’t start yet, and then it blocks all the traffic that goes to the protected zone, and the queuing mode where we have a queue. Then we are draining this queue, means the on-sale starts, and both modes are protected zone. What is a protected zone in the end? It’s a simple path. Usually, it’s a ticketing page, where everybody goes there to try to purchase the tickets. Then we have some details, for each protected zone, we have a queue format, for each protected zone in production, we have over 2000 protected zones running now. Then you have some attributes for that protected zone, like the state. It could be blockade, throttle, done, or we’re creating or designing a new protected zone. We have the path, the resource that we’re protecting. It could be 1 or 10, depends on how the event was created, you have many dates or not. The details of that event. The limits as well for the exit rate. The idea that you can get in to the protected zone, a user will be redirected to the protected zone if they have an access token. With the access token, requests are routed to the protected zone.
The Virtual Waiting Room Main States
Then we have exchanger function that is running in a certain period, draining that queue, that’s basically exchanging visitor tokens to access tokens. We are fetching all the visitor tokens that was registered that they don’t have an access token yet, then you are exchanging. When we’re updating the database, we’re updating the DynamoDB, this data is getting streamed. Then we’re streaming the change from Dynamo to a Dynamo Stream. Then we have a function that consumes that notification from Dynamo Stream, and notifies back the user saying that we’re running. We have the WebSocket open and we’re taking advantage of that. You are sending back the access token. It means that the user didn’t ask for the access token, it’s a reactive system. When we identify that ok, you are ready to go to the protected zone because your access token was created, then you notify the user. Then the visitor token is replaced. There is no visitor token anymore. Then the user has only an access token, with that access token, you can get into the protected zone. The page is refreshed, the user sends that access token. Then with that access token, without any call to the backend, the access token is validated. The security is important. We try to identify that real users are trying to purchase the tickets, and then the user can go to the protected zone.
Behind the Scenes – Leaky Bucket Implementation
Let’s see behind the scenes, let’s see how those components were made. In general, we are using a leaky bucket implementation. If we’re navigating in the seatgeek.com, we’re not seeing queue all the time, but we’re protecting mostly events in the SeatGeek. The protection is based on the leaky bucket implementation, where we have a bucket for each path, for each protected zone, it’s a bucket with different exit rates. Then you can see that when request comes, if that bucket that’s protecting that zone is full or not? If it’s full, the request is routed to the queue page. If it’s empty, in real time, we’re generating the access token. If that access token is from the same mechanism that I showed before, that with the access token you can get into the protected zone. Then real time, we’re creating an access token, then we’re associating to the request, and the user can be routed to the target. That’s an example of how it works.
Why Are We Using AWS Lambda?
Then, why are we using Lambda? Why do we have that infrastructure in Lambda? In general, in SeatGeek, for the product, we are not using Lambda we have another infrastructure running in Nomad that also runs on AWS. We have a completely different stack that runs the virtual waiting room in Lambda. Why Lambda? Because we’re trying to avoid cascade effects. For example, if we’re running our virtual waiting room together with the products that we’re trying to protect, and this environment gets on fire, then we have a cascade effect for the solution that is protecting that environment. Then it doesn’t make sense. Then you are trying to run aside of our product environment. AWS Lambda provides a simple way that you can launch from scratch that environment, and also supports a nice way to scale that environment based on the traffic, and is on demand as well.
Why Are We Using DynamoDB?
Why DynamoDB? Why are we relying on DynamoDB? First of all, the Dynamo Streams, it’s easy to stream data from Dynamo. With simple clicks in the console, you can stream the change that was made in the simple row for a stream that you can consume later. Comparing with MySQL, for example, to do that with MySQL, then we could use a Kafka connector that’s going to read the logs of MySQL and then get the logs and fires to a topic in Kafka. Then you can have problems with the team that supports MySQL. It’s a big coordination. It’s possible, but in the end, we choose Dynamo because it’s simple to stream the data. Also, because Dynamo provides a nice garbage collector. I think everybody is familiar with the garbage collector in Java, the idea that you can collect the data that you don’t need anymore. In our case, the data of the queue is important during the on-sale. After the on-sale, you don’t need that data anymore. In the normal operation, the size of our database is zero because there is no queue formed.
DynamoDB Partition – Sharding
There are some tricks regarding DynamoDB. It’s not so easy to scale. There are some limitations regarding the partition. Designing your partition matters. For example, the default limitations is 3,000 read requests that you can do, or 1,000 write requests in a specific partition. Over that, Dynamo starts to throttle. When we throttle we need to try to think in some mechanisms that you can have supported throttles, because if not, the user is going to receive a fail. You could, for example, retry. If you retry, you can increase the problem again. We are basically using sharding. The sharding is a way that you can scale the limits that Dynamo supports. Then, for example, with that simple code in Go, we are creating 10 different shards. Where we design our partition, our hash key, that is our partition key, and then we are appending the sharding to the partition key. Then we can increase the amount of traffic that Dynamo supports. For example, if we’re using 10 shards, you can multiply the supports by 10, instead of just for 3,000 requests per second, you’re going to support 30,000 requests per second, from 1,000 writes requests per second to 10,000 requests per second. It’s not for free. If you need, for example, to run a query, it means that you need to deal with 10 different partitions, 10 different shards. Then the full table scan means 10 shards that you need to go through and try to fetch the data. You can run that in parallel, then you can try to have different combinations and avoid throttles in DynamoDB.
Sync Challenge: DynamoDB – Fastly Dictionary
Then, finally, let’s talk about the sync challenge. We have the edge dictionary in Fastly, and we have DynamoDB as our primary datastore. We have the advantage of the Dynamo Stream. That problem to write to a table in the database and talk to another system, it’s a common problem, then it was addressed by the transactional outbox pattern. For example, if you’d like to write to users table in MySQL, and then fires a message to the random queue, there is no transaction bound on that operation. Then you need to find a way how you can try to guarantee the consistency between those operations. One way is the transactional outbox pattern, basically, we don’t call the two datastores at the same time. When the request to change the protected zone, for example, arrives, it talks to DynamoDB, then get updated in a single operation. Then this change is streamed, and we have a Lambda function that consumes that change in the Dynamo table, and then talks to the Fastly edge dictionary. Then if the communication with the Fastly dictionary fails, we can retry. You can retry until it succeeds. Then this way, we are introducing a little bit the delay to propagate the message, but we guarantee the consistency, because there is no distributed transaction anymore, there is no two components, two legs in our operation. It’s just a single operation, then we’re trying to apply that in sequence.
Let’s see an example of the edge dictionary. The edge dictionary is created as a simple key-value store in Fastly, then there’s a simple table that is in memory. Fastly offers an API that you can interact with that dictionary, that you can add items and remove items. This is an example of the code in VCL. With the same code in Rust, you can achieve the same, then you can take advantage of the edge dictionary, and also you can take advantage of the modern stack that you can run in the CDN using that kind of code. If you’d like to know more details about it, we’ve made a blog post together with AWS regarding how it works and how we are using AWS to help us to run our virtual waiting room solution. We have the link here.
Virtual Waiting Room Observability
I’d like to talk about the observability that’s the important part of our system, how we’re driving our decisions based on metrics, based on the information that we are collecting. We have different kinds of behaviors. We are relying on Datadog to store our metrics, but also we are using AWS Timestream database that provides us long term storage. Ideally, we can monitor and observe everything in terms of if you can see problems in the latency of Lambda functions, how Fastly is performing, how many protected zones do we have in our system? If there are errors, what’s the length of the queues? How long are users getting notified? All the dashboards provide the operators and the engineers a vision of what’s going on during the on-sale. We are also trying to take advantage of that to provide sensors. Sometimes we have traffic that’s not expected, then you can notify the users for example through their Slack, then they are going to be aware that it was unpredictable in terms of the traffic that’s going on.
Next Steps for Our Pipeline
Pellegrino: Ander talked about the current state, so the solution that powers 100% of all of our on-sales and all of our operations for a little bit over the past year. Let me talk to you a little bit about things that we’re looking to do in the future. We don’t claim that we have all the solutions yet. I would like to offer you a little bit of insight about how we’re actually thinking about some of these problems. The first thing is automation. Automation is a key important thing for us. Because as we grow, and as we have more on-sales coming, we start to see bottlenecks. Not a bottleneck, but having humans part of that process is not scalable for us. Our vision is to have on-sales being all done only by robots, meaning that a promoter can design their on-sales timeline. They can say, I’m going to have a marketing push happening at this time, and this is where the tickets can be bought, and all the rest is able to be done. Ander was talking about the exit rate, so we could adapt to the exit rate based on observing the traffic. We could also have different ways of alerting people based on, if that’s happening within the specific critical moment of an on-sale, that could have a different severity. Also, fraud detection is always an important thing for us. We want to get even more sophisticated about how we can detect when something is a legitimate behavior versus when something is actually an attempt of abuse.
Next Steps for Our Operations
I think that’s a very key point here, like our systems, they must understand in which mode they are operating under. That means each one of the services that we have, each one of the microservices, they should be able to know, am I running in an on-sale mode? That can inform our incident response process. Let’s say I have an issue that’s happening, people cannot get access to a specific event. If that’s happening outside of normal hours, it has a different incident priority, then if that actually happens during an on-sale, so the telemetry should also know that. I would actually like to be able to have each one of our dashboards reporting, what is my p99 for a specific endpoint, but actually, what’s my p99 only during normal operations, or only during on-sales.
Another thing that is critical for us and we’re making some important movements in that direction is around a service configuration. We do use several vendors for some of the critical paths. We would like to be able to use and dynamically change, so like say maybe I have not only one payment processor, but in order to guarantee that my payments are coming through, I can use several during an on-sale. Maybe outside of an on-sale that is not as important. SLOs, for us, whenever we define our SLOs, we need to understand our error budget, but we need to also be able to classify, what’s my error budget during an on-sale? We already do quite a bit of that. We would like to do that even further.
Summary and Key Takeaways
We talked about how it’s important to think about elasticity in all layers of infrastructure. Queuing are useful. People don’t like being in a queue, but they’re vital components. That doesn’t reduce the importance of actually designing elasticity in all the different layers. If you have a queuing system, maybe you should also think about how you’re going to scale your web layer, your backend layer, even the database layer as well. This was critical for us, like really understand the toolkit that you have at your disposal. The whole solution works for us so well, because we’re able to really tap into the best of the tooling that we had at our disposal. We worked with AWS closely on that one. I think we could also have done this solution differently using other different toolkits, but it would look very different. I would highly encourage you to really understand the intrinsic keys, and all the specific things about the system that you’re leveraging.
This is something that we started using, and it was a pleasant surprise. That’s a topic that we’re seeing more. I highly encourage, maybe you have a certain type of use case that fits into datastore, into moving some of that storage over to the edge. It’s a relatively recent topic. It works for us. I would encourage you to give it a try. Maybe that suits you. Maybe you have a high traffic website that you could leverage, like pushing some of that data closer to the edge and where the users are accessing from in order to speed up some processes. I’ll just encourage you to take a look at that.
Questions and Answers
Ignatowicz: One of the main topics is about my business logic or my infrastructure logic moving to the edge. How do I test that? How do I test my whole service when part of my code is running in a CDN provider such as Fastly? How do I do some integration test that makes sure that all my distributed system that is becoming even more distributed, we’re talking now a lot of microservices run in the same code, but pushing code for other companies and other providers, especially cloud code? How do I test that?
When you try to use third party solutions like vendors to provide that kind of thing, for example, we have a contract with BlazeMeter. BlazeMeter offers that, but it’s quite expensive. We decided to build our own solution, we are using AWS Batch with play. That’s a simple one. We are treating load tests like a simple long batch job that we need to produce. Then every Monday, we launch 1,000 browsers that runs against our staging environment. Then when we wake up, we receive the report to see how it’s going on, if it ran successfully or not. Then the rest of the week we have small executions only for the sanity check. The drawback is that we are running VCL, the old stack. Then we don’t have the new test that is possible to run for every pull request. You do that during the week, three times per week.
Ignatowicz: Do you dynamically determine when queues are necessary, or those have to be set up ahead of time? For example, what happens if someone blows up, and suddenly everyone wants tickets for a concert that was released last week?
Parra: It was the proposal of the waiting room solution internally the SeatGeek. We are a ticketing company, and then sell tickets as part of our core business. Then we need to deal with this high traffic. When you have the on-sale, for example, on-sale next week. This on-sale was planned one month ago. The target of the virtual waiting room was to protect all the events in the SeatGeek. All the events in the seatgeek.com are protected by the virtual waiting room by the Room. Then Vitor mentioned about automation. In the beginning of the solution, we are creating the protected zones manually. It doesn’t scale. We have thousands of events happening on seatgeek.com, in the platform. Then now we have an extension of this solution that basically gets all the events that are published through seatgeek.com, and then create protected zones for exit rates. What’s the next step of this automation? We have the planning of all the on-sales, when the on-sale starts, when is the first sale? Then with all the timeline information of each event, we can decide when the transition is going to be applied. For example, if the on-sale starts at 11:00, we’re going to automatically blockade the path at 10:30. Then at 11:00 transition to throttle, without any manual interaction. Everything is automatic.
Ignatowicz: You do this blocking for all the events?
Parra: All the events.
See more presentations with transcripts