eBay Using Fault Injection at the Application Level With Code Instrumentation

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

eBay engineers have been using fault injections techniques to improve the reliability of the notification platform and explore its weaknesses. While fault injection is a common industry practice, eBay attempted a novel approach leveraging instrumentation to bring fault injection within the application level.

This platform is responsible for pushing platform notifications to third party applications to provide the latest changes in item price, item stock status, payment status and more. It is a highly distributed and large-scale system relying on many external dependencies, including distributed store, message queue, push notification endpoints and others.

Usually, says eBay engineer Wei Chen, fault injection is carried through at the infrastructure level, for example causing a network failure to introduce an HTTP error such as server disconnect or timeout, or making a given resource temporarily not available. This approach is expensive and has a number of implications on the rest of the system, making it hard to explore the effect of faults in isolation.

But this is not the only possible approach, says Chen. Instead, faults can be created at the application level, e.g., adding a specific latency within the HTTP client library to simulate a timeout.

We instrumented the class files of the client libraries for the dependent services to introduce different kinds of faults we defined. The introduced faults are raised when our service communicates with the underlying resource through the instrumented API. The faults do not really happen in our dependent services, owing to the changed codes, but the effect is simulated, enabling us to experiment without risk.

Three are the basic instrumentations that eBay has implemented to force invoked methods to show faulty behavior: blocking or interrupting the method logic, for example by throwing an exception; changing the state of methods, for example altering the return of response.getStatusCode(); and replacing the value of method parameters, which consists in modifying the value of an argument sent to a method.

To implement the above three types of instrumentation, we have created a Java agent. In the agent, we have implemented a classloader which will instrument the code of the methods leveraged in the application code. We also created an annotation to indicate which method will be instrumented and put the instrumentation logic in the methods annotated.

In addition, eBay engineers also implemented a configuration management system to dynamically change how fault injection behaves at runtime. In particular, for each endpoint supported by the eBay app, engineers can alter a number or parameters to test specific behaviors.

According to Chen, eBay is the first organization in the industry to practice fault injection at the application level using code instrumentation. If you are interested in this approach, do not miss the full explanation provided in the original article.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Learnings from Spotify Mobile Engineering’s Recent Platform Migration

MMS Founder
MMS Aditya Kulkarni

Article originally posted on InfoQ. Visit InfoQ

Recently, Spotify Mobile Engineering Team elaborated on their experience with a recent platform migration. Working on an initiative under the Mobile Engineering Strategy program, the team migrated their Android and iOS codebases to build with Bazel, Google’s open-source build system.

Mariana Ardoino and Raul Herbster from the Spotify Mobile Engineering team pondered on the learnings from migration in a blog post. The migration effort impacted more than 100 squads across Spotify. Acknowledging that migrations of varying size and complexity are going to be “the norm” in the future, the team set the context by highlighting the need to define the scope of migration.

Often when the extent of migration is unknown, it makes sense to focus on values and understand the goals of the migration. The team recommends starting small with proof of concepts (POCs) and validating it with stakeholders as opposed to identifying all possible scenarios at the start. It is also useful to understand the needs of stakeholders with this migration by collaborating with them in these early phases.

When there is a large number of squads impacted and the progress is slow, large infrastructure and architecture changes may seem impossible. Such scenarios call for a greater level of stakeholder engagement. Being in contact with stakeholders via Slack/email groups, and sharing the progress through newsletter and workplace posts may re-highlight the importance of migration. Looking for automation possibilities may help during the migrations. Reserving time for research spikes is also a good option to try, which can include swarming with teams to work on the migration.

As a side, emphasizing the aspect of collaboration in the context of Agile / DevOps transformations, Nigel Kersten, CTO at Puppet said,

Fundamentally the problem is that all of these transformations have a massive people-interaction component, and the bigger and older you are as an organization, the more difficult it is to change how people interact, and the higher up the chain you have to go to create organizational change.

Spotify Mobile Engineering team mentioned that competing priorities are a “fact of life” for any platform team involved in migrations. Whether a migration involves adopting new technology or reducing the tech debt, the motivation level of the team may get affected due to slow progress on migration. The team recommends evaluating the progress of migration continuously, motivating the team by showing the positive impact of migration, and tweaking the approaches to achieve certain goals of migration.

Finally, discussing the aspect of accountability, the Spotify Mobile Engineering team advises not to expect internal/external alignment on driving change over a course of time. Using dashboards, maintaining a migration timeline, and using data or trend graphs may help visualize the progress and highlight adjustments that are required.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Microsoft’s New Memory Optimized Ebsv5 VM Sizes in Preview Offer More Performance

MMS Founder
MMS Steef-Jan Wiggers

Article originally posted on InfoQ. Visit InfoQ

Microsoft recently announced two additional Memory Optimized Virtual Machines (VM) sizes, E96bsv5 and E112ibsv5, to the Ebsv5 VM family developed with the NVMe protocol providing performance up to 260,000 IOPS and 8,000 MBps remote disk storage throughput.

Earlier, the company made Ebsv5 and Ebdsv5 generally available, which offer up to 120,000 IOPS and 4,000MBps of remote disk storage throughput. With the addition of E96bsv5 and E112ibsv5 VM sizes in public preview, the company aims to allow customers to consolidate existing workloads into fewer or smaller VM sizes and achieve potential cost savings. 

Edsv5-series virtual machines run on the 3rd Generation Intel Xeon Platinum 8370C (Ice Lake) processor (in a hyper threaded configuration) and can reach an all-core turbo clock speed of up to 3.5 GHz. In addition, the new VM sizes in the series only have SSD premium support and can have up to 672 GiB of RAM and sizeable local SSD storage (up to 3,800 GiB).

With the new VMs sizes, Microsoft tries to stay ahead of its competitors, AWS and Google. AWS offers memory optimized compute instances, including the latest Amazon EC2 R7iz instances (powered by 4th Generation Intel Xeon Scalable processors). Similarly, Google offers memory-optimized machines with the M machine series.

Priya Shan, a Senior Program Manager, explains in a Tech community blog post the benefit of the new sizes:

While the Ev5 VMs meet the performance requirements for many business-critical applications, some large on-premises database environments require even higher VM-to-disk throughput and IOPS performance per core which the latest NVMe Ebsv5 VM sizes can now support. The NVMe-based Ebsv5 VMs offer customers the performance to scale without rearchitecting their applications while reducing the cost of infrastructure and licensed commercial software running on those instances.

In addition, Michael Roth, an Azure HPC Specialist at Microsoft, tweeted:

They use NVMe and provide exceptional remote storage performance offering up to 260K IOPS and 8K MBps throughput, great for #HPC use cases.

The E96bsv5 and E112ibsv5 VMs are currently available in the US West Central region, and more regions will follow. Additionally, customers can sign up for the preview to access the new sizes. Lastly, the pricing details are available on the pricing page.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Kubernetes 1.26 Released with Image Registry Changes, Enhanced Resource Allocation, and Metrics

MMS Founder
MMS Mostafa Radwan

Article originally posted on InfoQ. Visit InfoQ

The Cloud Native Computing Foundation (CNCF) released Kubernetes 1.26 with the name Electrifying. The release has new features, such as Image Registry Changes, Dynamic Resource Allocation, and unhealthy pod eviction enhancements.

Also, there are beta features included in the release, such as non-graceful node shutdown, retroactive default storage class assignment, and an improved kubectl events subcommand.

Several features have been marked generally available or stable, such as support for mixed protocols, reserved service IP ranges, and windows privileged containers. In version 1.26, the CRI v1alpha2 API is deprecated and legacy authentication for Azure and GCP is removed.

In the new release, container images for Kubernetes are published entirely in a new container image registry endpoint registry.k8s.io introduced in the previous release. The change reduced the dependency on a single entity allowing the spread of the load between Google and Amazon and opening the door for other cloud providers in the future.

Dynamic Resource Allocation has been introduced to provide better resource management for advanced hardware such as GPUs and FPGAs. This will enable the scheduler not only to take CPU, memory, and storage into account but limit the access to such hardware.

Also, a new feature gate PDBUnhealthyPodEvictionPolicy has been added to define the criteria for when unhealthy pods should be marked for eviction when using a PodDisruptionBudget. A way to limit the disruption to running applications when pods need to be rescheduled.

In version 1.26, Service Level Indicator(SLI) metrics are exposed for each Kubernetes component in Prometheus format to monitor and measure the availability of Kubernetes internals.

Non-graceful node shutdown moved to beta in version 1.26 and it’s turned on by default. In the past, when a node is shut down or crashes but not detected by the kubelet, a pod that’s part of a StatefulSet will be stuck in a terminating status forever until manually deleted. With this feature turned on, pods will be forcibly deleted and new pods will be created on a different node.

Retroactive default storage class assignment was first introduced in version 1.25 and is now in beta. Such a feature covers the scenario in which cluster administrators change the default storage class. If this feature is turned on, all PVCs with either empty or no StorageClassName attribute will automatically use the new default storage class.

Another feature that was first introduced in version 1.25 and is now in beta is considering both node affinity and taint when configuring a topology spread constraint. This enhances the spread of workloads across cluster nodes to increase high availability as well as resource utilization.

Enhancements added to the kubectl events subcommand in v1.23 graduated to beta in this release. The purpose of those changes is to support all the functionalities of the kubectl get events command and address issues related to sorting.

Support for windows privileged containers became generally available in this release and enabled by default. This feature allows windows containers to have access to the underlying host for system administration, security, and monitoring or logging workloads.

The CPU manager, which enables better placement of workloads in the kubelet, also graduated to generally available in version 1.26 and turned on by default. This is useful for workloads that are CPU-intensive or sensitive to CPU throttling.

Kubernetes is an open source production-grade software system for deploying and managing application containers at scale.

According to the release notes, Kubernetes version 1.26 has 37 enhancements including 16 new, 11 becoming generally available or stable, and 10 moving to beta. In addition, 12 features are being deprecated or removed.

CNCF will host a webinar on January 17, 2023, to discuss the updates from the release team and answer questions from the community.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Podcast: Kanplexity as an Approach to Tackle Complex Problems

MMS Founder
MMS John Coleman

Article originally posted on InfoQ. Visit InfoQ

Subscribe on:






Transcript

Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture podcast. Today I’m sitting down across many, many miles and 12 time zones with John Coleman.

John, welcome. Thanks for taking the time to talk to us today.

John Coleman: Shane, thank you so much for having me on the show.

Shane Hastie: You and I have corresponded and know each other relatively well. But I suspect that many of our audience will not have heard of you before. So probably a good starting point is who’s John?

Introductions [01:02]

John Coleman: Yes, probably not a bad starting point. If you Google my name, you’ll find conductors of orchestras and all sorts of much more interesting people than me. My name is John Coleman and I call myself an agility chef, the metaphor being when you go into an organization and you have all sorts of fancy ideas of what you want to do and then you open the fridge and then you see, oops, you don’t have that many ingredients. So I try to figure out what dish we can cook together, and the key thing is the co-creating. But I’m also a trainer from a sense. I am a trainer with scrum.org and proKanban.org, and I was involved with Daniel Vacanti in creating Kanban Guide back in 2020 and got a couple of podcasts myself, the Xagility podcast for executives and Agility Island for our practitioners. And excited to be here today, thank you.

Shane Hastie: Thank you so much. So thinking of our audience who are technical folks, what are the challenges that you see them struggling with when you open the fridge in the organization?

Scrum is a good framework and does not fit in every context [02:00]

John Coleman: So the options that are available to engineers, for example, when they open the fridge and they’re being given the usual options and Scrum is really good. I like it. I’m a scrum trainer myself. But sometimes the personal values of the people doing the work don’t tie in with the Scrum values. The Scrum values are very nice. The people might be really, really nice, but they might just have a slightly different belief system and it can feel sometimes like Scrum can be a form of imposition if everyone’s supposed to be committing towards the Scrum values. And maybe there’s other little things as well that they struggle with.

Some people will say, can’t you always deliver something in 30 days? But maybe some teams struggle with that. So are there other options out there available for people who struggle with some of the rigidity of Scrum? But at the same time, I do understand why it’s there is to help people to deal with complexity. Not withstanding all that, it can still be quite difficult. And what might happen is as soon as I turn my back, the team is doing something completely different to whatever the option is whether it’s Scrum or whatever the option is, design thinking or whatever it is.

Shane Hastie: How do we help them make good decisions?

John Coleman: The key thing for me is helping the team to understand what the options are. And this is something a lot of change agents need to be careful with as well. So sometimes we like to present options that are within our own fridge. We have our sweet spot and there’s certain things that we’re good at helping people with, but maybe those options don’t suit, actually. I discovered Shape Up for example in the last few weeks, which is about refining for six weeks and then delivering for six weeks and not having a Kanban board or any of these kind of things. A different approach, not so agile and so on in the way they declare it, but agile in different ways. And even if that’s something that’s not my sweet spot, being open to learning more about that. That’s just an example. But whether options that are not within my catalog, if you like, are more suitable to a team. So I think it’s important for change agents to be really open to what’s suitable for a team and not just what I’m good at doing myself kind of thing.

Shane Hastie: One of the things that you are known for is talking around complexity. What does complexity mean to a technologist?

What does complexity mean in a technical context? [04:14]

John Coleman: Eventually, if I was to really boil it down, maybe there’s some things that we haven’t figured out. And even if we get together with our colleagues, we still wouldn’t get it figured out first time. We might have to spin our wheels a little bit… Maybe spin our wheels is the wrong metaphor, but maybe go into some kind of empirical loop, try to learn something, try something, maybe use experiments to settle debates and things like that or twist old ideas from the past to deal with the problem that’s in front of us. Sometimes our expertise can be limited to deal with the problem that’s right in front of us. Maybe we even need help from someone who is in a completely different field of work. Maybe we need fresh thinking. And so when engineers feel like they’re going into analysis paralysis or they feel it’s a group think going on, it could be a bit of an indication that maybe we need to be more evidence-based and we need to be more open to how we’re going to solve the problems that are in front of us.

Shane Hastie: So recognizing that I need to be more open as an engineer, how do I step into that space because I’m actually trained to make decisions.

John Coleman: Indeed. And essentially humility gets handed to us, doesn’t it, when we do experiments. And first of all, we need to be open to running some of these experiments or probes to try and figure out what the right option might be. And I would say 70 to 80% of the time we haven’t figured out something, what we thought might solve the problem isn’t actually what will solve the problem. It might even be the wrong problem. So often are we even looking at the right problem? And within that problem, are we looking at the right people who might have that problem and do we have the right views on what those people are looking for? I remember I had some people coming to me, they were doing an executive MBA and they were presenting to me their projects for the quarter. And I remember this group in particular and they had all these slides on this persona called Caterina, and Caterina did this and she did that and this is what she wanted, this is what she needed.

And I asked them, I said, “This is all really nice information. Did you try to recruit somebody like Caterina? Did you try to talk to someone that was like Caterina to see if that’s actually what her problem was and that’s what she needed?” And they didn’t. And even simple little steps like if you got UX’ers working with you or people doing discovery or research to try and understand what the customer wants, even going along just to take notes could really open your eyes to the way your customer or end user or consumer is actually using what you’ve built. They might be using it a very different way to what you intended. It might be much better way than you intended. You might never have realized that they would do that with what you built or what you might do next. So I like the expression if people use Scrum for example, there’s a thing in Scrum called a sprint review and it’s kind of like going to an event to see what the customer asks for, but she doesn’t want. It’s very difficult to figure out what the customer wants, isn’t it?

Shane Hastie: Yes, the I don’t know what I want but I’ll know it when I see it.

John Coleman: Yes, exactly.

Shane Hastie: Which is a perfectly legitimate approach to eliciting requirements and yet so often we don’t involve the customer. But wasn’t this what all of the agile methods and these ideas were supposed to help us overcome?

There are approaches for learning and understanding customer needs [07:39]

John Coleman: Yes, indeed. And some of them do a reasonable job of helping us to deal with stuff that we haven’t figured out. Design thinking, for example, lean startup, lean UX, really good at helping you to dig deep and learn humility, even if we don’t have humility at the start, it’s like getting into a boxing ring and you get your backside handed to you because you’re not able for the opponent that’s in front of you. And what we learn when we do experiments with those approaches really helps us.

Often we also need to be able to move some of our work that we haven’t really figured out into the work that we think we’ve figured out. And being able to deal with that transition, it’s important. And so having some options that allow you to have some stuff that you haven’t figured out yet, as well as stuff that you have figured out as well as some stuff that you thought you’d figured out but now you realize you’re in the wrong place, having something that’s flexible enough to help you to deal with those kind of categories of work that’s in front of you is important.

Treating everything as though we haven’t figured it out or treating everything as though we have figured it out can lead to bad results. For example, people do standups for example, or they do daily scrums or depending on what kind of approach they’re using. And often they talk about what they did yesterday, what they’re doing today, even though they’re not supposed to do that kind of stuff. That was only ever a suggestion. Maybe what I did yesterday was so obvious and really I don’t need to go into the details of all of that. Maybe I’ve missed a trick. Maybe we should have talked about, well what do we need to work on together today? How can we help each other out? Is there any work that needs a bit of love and attention that we forgotten about a little better or maybe something’s got stuck and we need to get behind it and see if we can get it sorted.

I think teams need options that help them to deal with all these different types of complexities that are in front of them. And some of the options that are handed to us aren’t so good at dealing with deep complexity. And some of them are very good at dealing with deep complexity, but that’s what they deal with. And then once you’ve figured stuff out, you kind of need some other option to help you. And I think that’s tricky for teams to be changing approach all the time in terms of depending on what they’ve figured out and what they have.

Shane Hastie: Having to change approach though is a way that we do then adapt to make sure when we shift from that uncertainty to at least a little bit of certainty. How do I sense when this needs to happen?

Understand complexity and read the signals to know when to shift modes [10:00]

John Coleman: I believe that we need at least somebody around the team or in the team to be educated and skilled on complexity and understanding what signals there might be and helping to essentially guide the team. I mean, a lot of teams are self-managing and so on. And you don’t even have to have a team, you can have a crew. For example, airline pilots when they sit in the cabin together, they might never have seen each other before, but they’re so well trained, there’s trust already in their training and so they can team quite quickly. So crews, you can have a crew for example. But teams and crews often need some kind of guidance. Can we expect everybody to be an expert on complexity and really understand the compass that you need to read? It’s not a straightforward compass of a north, south, east and west. It’s a kind of almost like three-dimensional compass.

Does somebody understand how to read the compass that’s in front of us and at least inform the team or crew about what might be happening. And then that might give some hints in terms of what might be appropriate response types to deal with this situation. So there was a great example at the end of 2021 in the UK where Boris Johnson was trying to figure out what to do when Omicron landed on the UK shores. And he was looking at the scientists and the people who do the numbers and all that, all his medical experts and so on. And they didn’t have an answer in terms of what would happen. And then he was looking at the people maybe doing the experiments in the labs and trying to see well will the vaccines and the booster that was available at the time will they work with Omicron, and they didn’t know they would need six weeks to figure it out.

And he made the decision actually to be a little bit paradoxical, what we call aporetic and kind of asking, are we actually doing the right thing here? And I believe, don’t know, this is my personal opinion, I don’t know if this is sure or not, but I believe that he went aporetic and he said, “Okay, I need to buy some time here.” And he brought us into chaos in the UK and got millions of people to get the booster very, very quickly, more quickly than any program done before. And whether it worked or not, history will tell us, but it seemed to help the UK to be the first economy to kind of come out of COVID. And that was a great example of someone who understood that the experts don’t have the answers. The people who are even experimenting will take a while, I need to do something here.

And a lot of people think, well you shouldn’t be in chaos. Sometimes you need to go into chaos, sometimes you need to release the shackles and just do something. But as long as you’re just not doing something that’s going to make it worse, that’s the trick. So having all of these skills, can we really expect everyone to have those? I mean, I’d love if that was the case. I would just expect that there would be at least somebody either in the team or crew or around it who can help the team or crew to understand what they might be facing and help them to realize that actually if they did spend more hours than that, that’s not necessarily the problem. Or getting more engineers to help them, that’s not necessarily going to help them. That might need to do something a bit more dramatic.

Shane Hastie: Building on that, you have a new thing in the world, Kanplexity. Tell us about Kanplexity.

Introducing Kanplexity [13:10]

John Coleman: Thank you, Shane. So I was working with a team in an oil company in early 2019 and they asked for some help with Scrum and they were really, really good people. Each of them had a PhD of some sort, different types of PhD as well, double PhDs. I believe one of the ladies was publishing articles in journals and so on, really, really smart team. And they were working together on a really, really complex problem. And I had to try and figure out something for them. At the time, there’s something in Kanban we call cycle time, you might call it flow time or lead time. We won’t get into debates about what you should call it. But essentially, how long does it take you to get a feedback loop from the situation in terms of whether we’re doing the right thing. And I checked it out, their cycle time was roughly four to six months for each item.

I felt that a lot of the options on the table weren’t really going to suit them. And so I met the team and I showed them a number of options. But one of the things I did as well was one of the things I learned over the years was to learn the business domain and technical domain of the people that I’m working with. And so even though I was on Wikipedia and doing my own research, I was actually asking these people to teach me about rock pressure and drilling under the ground and all these kind of things. And I’d never know as much as these people, but at least I had some idea what they were trying to deal with. And so that kind of reduced maybe what I would’ve thought of would’ve been maybe 10 options to maybe two or three. And I thought Kanban might be an option for them. That at least could we try to optimize our flow so that at the moment their work takes four to six months, but if we optimize the flow, can we get that down to shorter than a month even?

But because their work is deeply complex, I was troubled by this and it wasn’t just that it was deeply complex, they also had some constraints as well. You hear Steve Tendon talking about TameFlow for example, and he’s got a theory of constraints background. And so I did a lot of research understanding Steve’s work in terms of theory constraints and others. And so I had this problem, we had deep complexity, we had a real bottleneck, a physical bottleneck, a few physical bottlenecks to deal with, and what can I do to help these people? And so I had a feeling that maybe they need to do some kind of review with their stakeholders every five or six weeks, maybe. And I was flexible because I noticed it was really difficult to get these… these are really important stakeholders like senior vice presidents where that means (really) important persons meeting together (although they’d be too humble to admit that).

And it was really cool actually Shane, because I learned a trick as well from the team that I was working with. I would never before have asked the team to send out pre-read before our review because I found that they wouldn’t read it, but they did send it out and they didn’t put any major ceremony into preparing the pre-read because that itself would’ve interfered with doing the work. But maybe a couple of days they spent putting together a pack and sending that out. And the execs that came in, they actually had read the pack. And then what we had is what Michael Küsters would call a a Steve Jobs style review where we were talking about the work, we were talking about what we’re going to do next. We were deciding on strategy and tactics in the review. And so that was really cool.

And then we started doing retrospectives and they were meeting every day, pretty much every day. They didn’t have planning so much, they were pulling in work as needed and so on. But they did have five to six-week periods and I was just flexible about when things happened and used really nice techniques as well in terms of helping them with retrospectives. And applied a trick as well, which I do now regardless of approaches where instead of doing really fancy retrospectives, and I did use some really fancy techniques and so on, but I realized that it was more important that the team actually got better. And so what I said to the team when they did have a retro, I said, look, let’s say you’ve got an hour and a half or a retrospective or something like that or maybe two hours or something like that.

Build time into the retrospective to fix things immediately [17:00]

We all know what the problems are in 20 minutes, don’t we Shane? So we said okay… I could have said, okay, let’s wrap it up now and we can put those items onto our list and we’ll look at those a later date or even next week or whatever. But I said, “We’ve got an hour, 10 minutes, how about we fix something right now?”

They said, “What? Huh?”

“Yes, we have an hour, 10 minutes. Do you think the lady at reception’s going to change because you want her to change?” I was just being a bit dramatic. Basically, in other words, change what’s in your control and let me deal with the stuff that’s beyond your control and I’ll work with the organization to get that stuff fixed. And we were really thorough about that. And so in the retrospectives then one of the first things we were doing is we were looking back on the things that we’d improved and they noticed, oh yes, we did this and we did that. Only one little thing every cycle if you like and to get better. So yes, Kanplexity was born so I codified it at the time, Shane, and I wrote a document on it and put it together. And so that’s how it started.

Shane Hastie: And what would the benefit be for a team to think about Kanplexity as an approach?

When is Kanplexity an appropriate approach? [18:00]

John Coleman: if you find that you need another option, if your work takes longer than 30 days, there’s lots of techniques to help you to get worked on shorter than 30 days. You can use example mapping for example from Matt Wayne where you can say there’s 13 examples for a particular item. And you can say that a team, well can we just do one example? And they say no, we need to do the other 12. Yes, maybe you do, maybe you don’t. But can you just do just one example and then let’s learn. And then you might find out there’s four new examples the next time you do a review and you might find out seven of the examples you thought you had to do, you don’t have to do it all. And so there are lots of techniques to help people to get work done in less than 30 days.

But let’s say you’ve tried all of your tricks and all that kind of stuff and you still find that they need lot more than 30 days. Having something that is flexible is important I think without losing our soul at the same time. We still want to be sincere about trying to achieve some kind of agility, but maybe having something that’s a bit more flexible that maybe you don’t need to have specific roles. Maybe all you need is a team or a crew and maybe someone who’s willing to be the guide, could be someone in the team, could be an executive, could be a change agent, and them working together to look at the compass and orient and decide what to do next. That’s a major upgrade in the 2022 version by the way, that I basically refactored it based on Kanban Guide, which was released in 2020 so I don’t have to redescribe Kanban anymore and I could just focus on the complexity side.

And in the orientation guide it indicates that if you feel you’re… It even describes which space you might be in and give some examples and so on. But then depending on the category we’ve tried to codify because Cynefin is what is based on the decision making framework from Dave Snowden and Cognitive Edge and the Cynefin company. And we tried to make it a bit more accessible for people so that okay, I’m in this area so I need to facilitate for this, I need to optimize for that, this is what I should be looking for and maybe these are some things I can do. These are some response types. So that’s been a big addition to the guide.

And the other thing that’s changed in the last three years as well is, it’s much more informed by the different types of agility out there, like the Vanguard Method from John Seddon, and product management and product leadership, like deep listening, even from Indi Young. Apparently sometimes we don’t need to have the script when we interview the customer, sometimes we should just listen to what they have to say and just really, really listen and not be derailed by our own goals.

And lots of influences. Theory of constraints and there’s elements of Agile and Lean, of course. And even intent-based leadership from David Marquet. It’s there’s a big influence there. Particularly David Marquet’s most recent book as far as I’m aware, Leadership is Language where he talks about the red work and the blue work. I believe he was on this show. And I really like that in David’s work and it’s a really simple message. Are you in execution bias? I had 200 students at a university in London in January this year, 250 of them using Lean UX and having humility handed to them by the experiments that we’re doing. 70% of them were in execution bias, they were persevering when the evidence was overwhelming that they should either stop or pivot. In fairness, some teams collapsed into other teams and I actually congratulated them for that because that takes humility, doesn’t it? To say, okay, this idea isn’t going to work.

So yes, lots of these influences are in there now and really tested it with lots of teams since those early days when I was at the oil company, worked in other sectors as well. I will confess though, Shane, that even though I have a software development background myself, Kanplexity is optimized for people in non-software. You could use it in software equally. But I was thinking, do you know what? There’s probably a need also for some options for people outside of software because the way engineers are using agility in whatever way they are. But a lot of the time there’s frustration that the rest of the organization isn’t also practicing agility and it is spreading across the organization. And often I find that some of the options available for those departments, for example, might not be so suitable. And just trying to provide something that says, for example, if you haven’t figured stuff out, maybe you do need to have a review.

But if you have to figure stuff out, maybe you don’t. Giving you some hints about when you might need to have a standup and when you might not. If everything is clearer and you know what you need to do, you probably don’t even need to get together. And also there’s an influence from flight levels as well because in flight levels they’ve got interactions, they don’t have events and some of those interactions don’t have to be synchronous. I’ve been using a tool called Loom, for example, to send video messages. I sent one only a few hours ago to someone and it’s very handy across time zones and particularly if I’m working part-time for a client or something like that, send a video and they can like or react or comment at two minutes, 37 seconds and reply, “No, I didn’t like this, John.” They can even reply to the video with a video. So being open to people not having to actually get together for some of these interactions as well. We’ve learned a lot, haven’t we, in the last few years. So it’s kind of adapted to all of that as well, I think.

Shane Hastie: Some really interesting ideas here and things that I’m sure some of our audience are going to want to explore further. To do that, John, how do they get hold of you, and where do they find it?

John Coleman: Thank you, Shane. So the easiest place to find a Kanplexity will be Kanbanguides.org, that’s plural, Kanban Guides as in G-U-I-D-E-S.org. Also, orderlydisruption.com, which is my own company or x-agility.com. It’ll be on any of those three websites. By the time this goes out, it’ll already be up there on those sites. And yes, you could find me there as well. There’s Contact Us forums at those places. And you can also check me out on the Xagility podcast or the Agility Island podcast.

Shane Hastie: John, thanks so much for talking to us today.

John Coleman: Shane, thank you so much for having me. It was a real pleasure. Thank you.

Mentioned:

About the Author

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Examining the Past to Try to Predict a Future for Building Distributed Applications

MMS Founder
MMS Mark Little

Article originally posted on InfoQ. Visit InfoQ

Transcript

Little: I’m Mark Little, VP of engineering at Red Hat. I’ve been at Red Hat since 2006, when they acquired JBoss. I’m here to present to you on whether we can examine the past to try and predict the future for building distributed applications. I hear from some of my organization, and customers, various concerns around how building applications these days seems to be getting more complex. They need to be experts in so much more than just building the application. That’s not actually wrong. I want to look back and show that’s always been the case to one degree or another, and why. As a side effect, I also hope to show that application developers have been helping to drive the evolution of distributed systems for years, even if they don’t always know it.

This is not in Java, or an enterprise Java story. I’ll use Java and Java EE or Jakarta EE if you’re up to date, as examples, since many people will know about them, and if you don’t just know about them, you may have also even used them directly. This is not meant to be, A, a picture of me, and B, me complaining in this talk about complexity and about the impact on application developers. I’m trying to be very objective on what I’ll present to you. Hopefully, there are some lessons that we can all learn.

Where Are We Today?

Where are we today? Hopefully, there’s nothing in this slide that’s that contentious. Hybrid cloud is a reality. Developers these days are getting more used to building applications that are using one or more public clouds and maybe even private clouds, and in combination. Event driven architectures are used more often. Probably Node.js is the framework that really started to popularize this a lot. Then when cloud came along with serverless approaches like Amazon Lambda, or Knative, for instance, it became accessible to developers who didn’t want to use JavaScript, for instance. Then there are other popular approaches, I put one here, Eclipse Vert.x. It’s in the Eclipse Foundation. It’s a polyglot approach, even though it’s based on the JVM, similar in some ways to Node.js. Linux containers have come along. I think they’ve simplified things. Kubernetes simplified or complicated things. Cloud services, again, developers today are far more comfortable than maybe they were even five years ago with building applications that consume more of the cloud services that are out there, like Amazon S3, for instance.

IoT and edge are very important now, as well as cloud. Lots of people working in that space, connecting the two together. I think open source is dominating in these new environments, which is a good thing. Developers are really the new kingmakers. You don’t have to go too far to probably have heard that before. Is this really a developer’s paradise? It sounds like there’s lots of choice, lots of things going on. There are complexities, which cloud do you choose? Which Kubernetes do you choose? What about storage types and security implementations and distributed transactions? Which one of those do you take? Which model? It does seem like building apps today, the developer needs to know far more than they did in the past. In fact, if you’ve been in the industry long enough, it’s like it was in the ’80s with CORBA, as we’ll see. Developers are having to know more about the non-functional aspects of their applications and the environment which they run, but why?

What Application Developers Typically Want

I’m going to make some sweeping statements here, just to try and set some context. What do application developers typically want? I think, typically, they do want to focus on the business needs. They want to focus on the functional aspects of the application or the service or the unit of work that they’re trying to provide. What is it that they have to develop to provide a solution to the problem and answer to the question? Yes, all good application developers want to test things. That’s an interesting thing for them, something they need to understand. They’ll be thinking about their CI/CD pipelines, probably wanting to use their favorite language and probably wanting to do all of this in their framework, and favorite IDE. What do they not want? Again, sweeping statement. I think generally, they don’t want to become experts in non-functional aspects of building or deploying their application. For example, they probably don’t want to become clustering gurus. They don’t want to become an SQL king or queen. They don’t want to become transaction experts. Although speaking as a transaction expert, I do think that’s a great career to get into. They probably don’t want to be hardware sages, either. They don’t necessarily want to know the subtleties of ensuring that their application runs better on this architecture versus that, particularly if one has got GPUs involved, or one does better hyper-threading. These are things that really they would like to have abstracted away.

The Complexity Cycle

I’m trying to show you what I’m calling the complexity cycle. Hopefully, as I explain it, you get the idea. This is a dampening sine wave. Actually, it’s only natural that when a new technology comes on the scene, it isn’t necessarily aimed at simplification, or the developers behind it aren’t necessarily thinking about, how can I make this really easy to use? In fact, simplifying something too early can be the worst thing to do. Often, complex things require complex solutions, initially. As you probe the problem, as you figure out whether this thing works or that thing works better. As we’ll see in a moment, if you look at distributed systems growth, you’ll see that it was a lot more complex initially, and then did get simplified over the years. Part of that simplification includes standardization. Frameworks come on the scene and abstractions to make whatever it is more simple, maybe a bit more prescriptive. All this leads to simplification, and future developers therefore needing to know less about the underlying technology that they’re using, and perhaps even taking for granted.

Reality

It’s hard enough with just one damped sine wave. We all know the reality is that today’s application developer has to juggle a lot of things in their mind, because there is so much going on, as I showed earlier. I think application developers go through cycles of working on non-application development things as a result, including infrastructures. Because when they start down a path to provide a solution, they might suddenly find that there is no solution for something that gets in their way to provide the next banking app, for instance. A really bad example. Or next travel booking system. That takes them down a detour where they maybe by themselves or with others, provide a solution to that problem they run into, and then they get back on the main road. Then they finally build their application. In fact, in many cases, that developer pool, the pool of developers who have built distributed systems and provided solutions to clustering and SQL and transactions. It’s seeded by application developers, some of whom then decide to stay on in that area, and probably never get back to their application.

Where Are We Heading?

I hear a lot from developers in my org, and many of our customers that they are spending more time on non-functional aspects than they want. It worries some of them because they’re not experts in Kubernetes, or immutable architectures, or metrics, or observability, or site reliability engineering. Like, do I really need to carry a pager for my application now? For those of you in the audience who’ve been in this industry long enough, you’ll probably remember similar thoughts and discussions from years ago, maybe around REST versus WS-*, or even J2EE against Spring Framework. I also think as an industry, we often focus too much on the piece parts of the infrastructure. Like, I’ve got the best high performance messaging approach, or I’ve got the best thread pool. We do need to consider the end user, the application developer more. We should start thinking maybe a little bit early about what does this do for the application developer? Am I asking that developer to know too much about this?

Sometimes, particularly in open source, rapid feedback, which we often suggest is great for open source, or a good reason for using it, it doesn’t necessarily help as the feedback you often get in many upstream communities is from the same audience that are developing that high performance messaging. It just keeps exacerbating that problem. I think if you try and look at this from a different industry point of view, this is not what we expect from the car industry, for instance. As far as I know, you don’t hear about them having conferences purely about the engines, or their seatbelts, and how they’ve got the best seatbelts. They talk about the end result, how you make the whole thing come together and look beautiful, and perform extremely well.

Distributed Systems Archeology

Before we are able to look back to predict, hopefully, what’s going to happen in the future, it’s important that we look back I suppose. Distributed systems began in the ’70s, probably the ’60s really, where developers wanted to connect multiple computers together. They built systems that could talk between heterogeneous environments, send messages, receive messages. Between the ’70s and the ’90s, these were though, typically, single threaded applications. Your unit of concurrency was the operating system process. Many languages at the time had no concept of threading. If you were lucky, and you were programming in C, or C++, and maybe other languages had something similar, but you could do threading if you setjmp and longjmp, you could build your own thread packages. In fact, I’ve done that a couple times. I did it as an application developer. I was building something, it was a simulation package. I needed threads, there were no thread packages in C++ at the time. We built a thread package. Then went back to the application.

That’s the complexity that you had to get involved in because you came across a problem that got in your way and you had to solve it. Then I think if you look at the core services and capabilities that started to emerge back then, you can see what was being laid down then is now influencing where we are today. They were talking about transactions and messaging and storing, things we take for granted. They had their birth in the ’70s to the ’90s. We need to also remember, there’s something called the fallacies of distributed systems, or Notes on Distributed Systems by Waldo, where they showed that distributed systems shouldn’t look like local environments, local systems. In the ’70s to the ’90s, we spent a lot of time trying to abstract distributed systems to look like local environments. That was not a good thing. Go and look at that. In some ways, that’s an example of oversimplification, and frameworks trying to hide things, which they probably shouldn’t have.

CORBA (Other Architectures Were Available)

One of the first standard efforts around distributed systems was CORBA from the OMG. Here’s a high level representation where we have these core services, like transactions, and storage, and messaging, and concurrency control. Everything still in CORBA initially was still typically single threaded. COBOL, C, C++, no threads. POSIX threads was evolving at the time, and then eventually it came along and was more widely adopted. Developers still needed to know a lot more than just how to write their business logic. Yes, they didn’t need to know the low level, things like network byte ordering anymore, because CORBA was doing a really good job of isolating you from that. It was getting simpler. Abstractions were getting more solid, but you still had independent services. At least they conformed to a standard for interoperability and some portability, but you still have to think which ORB, which language?

CORBA and systems that came before it and many that even came after it, was also pushing a very closely coupled approach. What I mean by that is in CORBA, for instance, you didn’t typically define your service endpoints in the language that they were going to be implemented in. You used something called IDL, Interface Definition Language. That could then create the client and service stubs in the right language. You wrote the IDL once, and you could make that service available to end users who wanted to interact with it through C, C++, Java, COBOL. You didn’t need to worry about that, as a developer. That was a good simplification. The IDL generator would do all of that work for you. However, because the IDL is essentially the signatures of the methods you’re going to invoke, it meant that, like in any closely coupled environment, if you change the signature. You change an int to a double, you have to regenerate all the client server stubs in all of the languages that will be used, or existing clients will suddenly not be able to use that service. Essentially, you change one signature, you have to change everybody. There are some advantages. That was certainly a simplification approach over having to know host port pairs and network byte order and big-endian, little-endian from the ’70s, absolutely. It had disadvantages in that, like I said, it could be quite brittle to the distributed system.

In the late ’90s to early 2000s, we saw the chip explosion and performance of chips went exponential. You had multi-core, hyper-threads, RAM sizes exploding. It went from kilobytes to megabytes. Obviously, we’ve gone much further than that. Interestingly, network speeds were not improving as quickly. One of the things that that then pushed was the colocation of services to get out of having to pay the network penalty.

The Application Container

Colocation became an obvious approach to improving performance, and reducing your memory footprint. If you could take your multiple services that were consuming n times the memory and stick them into one service, then hopefully the memory size wouldn’t be n times m, where m is the number of services. Your application container, and I’m using that term so that we don’t get confused with Linux containers, or operating system containers, it does much of the work. It was also believed to be a more reliable approach. In this era, colocation becomes the norm. In fact, in Java, it takes only a few years, other languages took quite a while. In fact, in CORBA, if you are still involved with that back in the day, their component model was essentially a retrofit of the J2EE component model. Abstractions at this point have become a key aspect of building distributed systems. J2EE, Spring Framework all tried to hide the non-functional aspects of the development environment, to enable the developer to build more complex applications much more quickly. You abstract clustering, more or less, and you abstract transactions. Developers are typically only having to write annotations, but it’s not perfect. However, developers, I think, start to feel much more productive in this new era with these application containers than they were in the CORBA era.

J2EE Architecture

This is an example of a typical application server with the application container. Obviously, I’m using JBoss here because of my background, but there were many other implementations at the time and still quite a few today. You can see the core services, thread pool, and connection pooling, and the developer sees very little as the container is handling it all. Like I said, the developer typically annotates the POJO, says this method should be transactional, this one should have these security credentials, and the container takes care of it all, really simplified. At the tail end of that, you know that sine wave there.

Application Container Backlash

Frameworks such as Spring do a much better job of abstracting this and making the lives of developers easier than some other application container approaches. I have to include J2EE there, even though my background is much more than that, and even in CORBA. Credit where credit is due. Spring developers at the time recognized that there was still much more complex in J2EE than there needed to be. It had become to a point where it really wasn’t as simple to develop the easy stuff in some of those application containers. Because we started to add more capabilities into these things, which not every developer wanted, they became more bloated. They took up more memory footprint. They took longer to start up. There was this all or nothing approach, you had to have everything in the J2EE spec, for instance. Until the Web Profile came in towards the latter half of J2EE’s life, you had no choice. Yes, you could tune things, you could remove things in some implementations, but it was a non-standard way. It might work with JBoss, but probably it wouldn’t work with WebLogic, and vice versa. Also, at the time, there was this big push related to the bloating and the slow startup, in some regards. To trying to have long running applications and dynamic aspects pushed into the application container itself, so that you didn’t have to stop the application server if you wanted to make a change to the messaging service implementation, for instance. OSGi came along to try and help that on their rather bespoke module systems. All of this brought more bloat and started to not be a good experience for many application developers.

Services Became the Unit (SOA)

Then, services become the unit of coding again. I think a lot of that was based on problems with building systems from closely coupled approaches. Teams couldn’t be independent enough. Change your method signature, you had to let everybody know, I will use this now. People wanted to get away from application containers for some of the reasons we mentioned earlier. We saw the re-rise of services and service oriented architecture. Like I said, CORBA had this originally, you could argue that it’s similar to that. SOA was meant to focus on loosely coupled environments, so that’s where the similarity with CORBA starts to break down. CORBA is very much more closely coupled. In SOA, you have a service as a unit of work. It’s not prescriptive about what happens behind the service endpoint. Some way similar to an IDL, but the idea was that you didn’t have to inform end users if you made a change to a signature. You could do that dynamically or you could let them know this signature changed, maybe try this version instead.

Web services or WS-* came on the scene at about this time, they often got equated to service oriented architecture, but they weren’t. You could write SOA applications in CORBA if you wanted to. You could write them in J2EE. You could write them in web services, but you could just as easily write non-SOA applications in these same technologies. Generally, web services should not be equated with SOA. ESBs came on the scene about the same time, the same sort of thing. There were lots of arguments about ESBs about whether they were SOA, whether ESBs were web services. Whether you should put the intelligence in the endpoints versus the pipes. The answer is the endpoints, I think. REST which has been around since the birth of the web, there was a lot of debates at the time. In many cases, I think, you only have to look where we are today. REST won. The REST approach I think is much more conducive to SOA and loose coupling than WS-* was, and probably ESBs.

Now of course, it was Einstein who said make everything as simple as possible, but not simpler. Perhaps I think the rush to simplification and improving performance led to more complication at different levels in the architecture, in this SOA architecture essentially. Despite what I said earlier about web services not being equated with SOA, they did tend to get used a lot in SOA. In some ways, as we’ll see later on, maybe it gave SOA a bad rep.

Web Services Standards Overview

This is a really old screenshot of some standards from 2008, I think 2006 to 2008. This is the WS-* stack. You can see that complexity has started to come back in here. There was a lot of effort put into individual capabilities in WS-*, the enterprise capabilities that various vendors and consultants and individuals said that was needed for building enterprise applications. It doesn’t matter whether you’re building them with J2EE, or you’re building them with SOAP, you need things like messaging, and eventing, and all these things. My concern, looking back now is there was not enough thought placed on how this plethora of options and implementations would affect the application developer. There was lots of backlash as a result. Frameworks and abstractions actually don’t manage to keep up initially. They forced developers to hand code a lot of this, and that led to the inevitable demise, I think.

Monoliths Are Bad?

At that turning point, I think of SOA web services. Then we started to hear about how monoliths are bad. I don’t buy into this. It’s pretty much like everything, there’s no black or white, there’s no silver bullet. If you’ve got a monolith, and your monolith is working well for you and has been working well for you for years or decades or whatever. It doesn’t matter if it’s written in COBOL. If it’s working, don’t refactor it, just keep using it. There’s nothing wrong with a good monolith. Not all monoliths are bad. It doesn’t mean you shouldn’t refactor all your monoliths, but don’t think that because you’ve got a monolith, you’ve necessarily got some of that. You have to put engineering effort towards rewriting in the latest cool language or the latest cool framework. I would absolutely recommend if you haven’t seen this, go and look at The Majestic Monolith. It’s a very pragmatic approach. It’s a relatively old article now. It’s still online, but you should definitely go and look at it. It hopefully says things a lot better than I was able to.

Microservices

At that point, when monoliths are said to be bad, then we suddenly see this discussion about microservices, which are still around today. Hopefully, everybody knows that. It’s almost a decade ago that Adrian Cockcroft, and others started to talk about pragmatic SOA, and that became microservices. We sometimes forget about the heritage, but it’s there. They are related to SOA. Despite the fact that many people at the time were trying to point this out, unfortunately, I think we failed to learn quickly enough from history in this case. That this approach, in many ways was going to be turning back the clock and adding some complexity. Absolutely benefits, but there was a lot of complexity that came in very early on that I think perhaps we could have avoided.

I think the reasons for this complexity are fairly straightforward. For many years, as an industry and even in academia, we’d be pushing developers to be less interested in knowing they’re building on distributed systems with coarse-grained services and three-tier architectures. Our tools, our processes, even education were optimized for this almost localized development. Yes, there were distributed systems, but there aren’t many components, two or three. Then, suddenly, with microservices, we’re turning back the clock to the ’70s. We’re telling developers that they need to know about the distribution. They need to understand the fallacies of distributed systems. They need to understand CAP, and monitoring, and fine-grained services. I’m not saying they don’t need to understand those things. I think that they don’t necessarily need to understand all of them, and certainly not in a lot of depth as we were asking them to look into almost a decade ago.

Linux Containers

There’s definitely been one great simplification in my view in the previous years for application developers, and that’s operating system containers. Yes, I know the concept predates Linux with roots in the ’60s and ’70s, but for the purposes of our talk, we’ll focus on Linux. Linux containers came on the scene around the time of microservices, so some people start to tie them together. The reality is that they can and are used independently. You don’t have to use containers for microservices. Likewise, your containers don’t have to be microservices. Unless you’re dealing with actual services, which tended to be on a static URL, for instance, and you only needed to worry about how your clients would connect to them. Distributing your applications has always been a bit of a hit and miss affair. Statically linked binaries come pretty close, assuming you just want portability within an operating system. Even Java, which touted originally the build once, run anywhere, that fails to make this a reality as well. If you’re a Java user, hopefully you’ll understand what I say. Just look at Maven hell, for instance.

Kubernetes

Clearly, one container is never enough, especially when you’re building in a distributed system. How many do you need to achieve specific reliability and availability metrics? Where do you place those containers? What about network congestion or load balancing or failover? It is hard. Kubernetes came on the scene as a Linux container orchestration approach from Google. There were other approaches at the time and some of them are still around like Apache Mesos or Docker Swarm. Kube has pretty much won. Developers do need to know a bit about Kubernetes, and specifically immutability, but they shouldn’t be experts in it. I think we’re still at that cusp where it’s too complex for Kube in some areas and for some developers. The analogy I use, as you probably know and use a VPN, but you don’t necessarily know how it works. You don’t need to. Back to the immutability. Immutability is the biggest change here that Kubernetes imposes. It assumes because of the way it works, that when it wants to fire up a container image somewhere, it can just pull that container image out of a repository and place it on a machine and start it up. If that’s a replica image, it might take down another image or that other image might have failed. It doesn’t need to worry about state transfer. It’s all immutable. If you want to change that image, like the example we were using earlier with application containers, and you want to change the messaging implementation that’s in that container, you create a new image. You put that new image in the Kube repo. You do not change that image as it is running, because if you do, if Kube takes it down, you’ve lost that change.

Why is this important? Because for almost two decades, especially in the Java community, we’ve been telling application developers and application container developers to focus on making dynamic changes to running applications, whether through OSGi, or equivalents, as we discussed earlier, leaving runtime decisions to the last minute or even the last second is the approach we’ve asked developers to take. Yet immutability goes against this. Decisions like that should be made at build time, not runtime. We’re only now starting to understand the impact that this is having on application developers, and some frameworks are now starting to catch up but not all frameworks.

Enterprise Capabilities

What about enterprise capabilities? We’ve seen this with web services, and with J2EE and others. We’ve been focusing a lot more of course into containers, and trying to simplify that. What goes around those containers and between them? Things like messaging, and transactions. Applications still need them, it doesn’t matter whether it’d be deployed into the cloud, public or private. You probably still need to have some consistency between your data. You want to be able to invoke those applications very fast. We’re actually seeing application containers, which if you remember, in many ways evolved from CORBA-like approaches and colocating those services into a single address space. They’re now breaking apart. Those monoliths are becoming microservices, so you’re having independently consumable containers and services. Those services are going to be available in different languages, so it’s not just REST over HTTP, there are other protocols that you can evoke them through JMS, Kafka. Lots of work going on in this space. We’re going back a bit. They’re no longer residing in the same address space. With that comes the problems that we perhaps saw with CORBA, where the network is becoming the bottleneck. What does that mean from a performance point of view?

The inevitable consequence of this concern about what goes around the containers and how you glue them together and how you provide consistency, is complexity. Whether you bought into web services or not, and that complex diagram we saw earlier, you do need these things. Perhaps web services could have provided them in a different way, maybe through REST, for instance. There were different approaches from, that REST had a REST [inaudible 00:35:09] had at the time, that were perhaps more simple. You do need these capabilities, or you need many of these capabilities. I want to show you this. This is the CNCF’s webpage. CNCF is a group that is attempting to standardize and ratify a number of different approaches to building in cloud environments. Various projects are there like Knative, Kubernetes. They want to be this place that helps application developers choose the right tool for the right job. The thing to know here is that a lot of these smaller icons are not necessarily competing specifications for one of the [inaudible 00:35:59], they are actually different implementations of the same thing, so different CI/CD pipelines, for instance. That is still complex. Which group do you choose? Which pipeline do you choose? Does that pipeline work on the group you’ve just chosen? There’s a lot of complexities still in this environment and evolving.

There Are Areas Where Complexity Remains

There are areas where complexity remains. I think that’s because we haven’t quite figured out the solution domain yet. Abstracting for application developers, it is a little early. I think we’re doing the right thing in a number of areas. We’re innovating, and trying to innovate rapidly, rather than standardizing too early and providing frameworks that abstract too early. We need real world feedback to get these things right. Unfortunately, that complexity is there. It has to be there until that sine wave decays, and we get to the point where we feel we understand the problem, and we have good enough solutions. I think application developers have been at the forefront of distributed systems development for decades. It’s inevitable. It’s how this whole industry evolves. If we didn’t have people to build applications, then we wouldn’t have Kubernetes, we probably wouldn’t have Android or other equivalents.

Areas Still To Be Figured Out for the Application Developer – Consensus

I did want to spend the last few minutes on my thoughts on areas that are still to be figured out, areas where there still is complexity, where we probably have a few more years to go before we can get to that simplification. They include consensus in an asynchronous environment, large scale replication, and accountability. I think complexity will unfortunately continue in these areas, or fortunately, if you are in that mindset to want to get involved and help fix things. Consensus is incredibly important in a distributed environment that you can have participants in an application or in a group, agree on what the outcome of some unit of work is. Transactions, two-phase commit, for instance, that is a consensus protocol. Getting consistency with multiple participants in the same transaction is very important.

We have growing need, though, in the world that we now live in, whether it’s cloud, or cloud and edge, or just edge to have consensus in loosely coupled environments. Two-phase commit is a synchronous protocol, very suited to closely coupled systems. Like local area networks, for instance. As we grow our applications into the cloud, and the cloud gets larger, and you have multi-clouds, it’s inevitable that your applications become much more loosely coupled. You still want that consistency in your data, in your conversations. Multi-party interactions are becoming more conversation-like, with gossip protocols, where I tell you, you tell somebody else. I don’t tell somebody else directly. Then, at some point, we come together and then decide that was the answer. Not that but this. How do we do that though, in the environment where we have failures or slow networks? Like I said, active area of research and development, but I hope this is an area where we can actually learn from the past. You don’t have to go back that far. Less than a decade and you can find lots of work by the RosettaNet team and ebXML, where they’ve done this work. No central coordinators, lots of autonomy, and yet with guarantees.

Replication at Scale

Then there is replication at scale. Transactions, I’ve mentioned them a few times, or extended transactions where you loosen the ACID capabilities. Consistency is important for many applications, not just in the cloud, but outside the cloud, which predates the cloud by a long time. Relaxing ACID properties may help. There’s been a lot of work going on in that over the last 20-odd years. We have to explore the tradeoffs between strong and weak consistency. Strong consistency is what application developers intimately understand. It’s really what they want. I haven’t spoken at all about NoSQL at the moment, but one of the things about NoSQL is it typically goes hand in hand with weak consistency. While there are use cases where that is really important, it’s harder for most developers to think about and to build their applications around. Frameworks have often not been that good in helping them to understand the implications of weak consistency.

You only have to look at Google, for instance. They did an about face. They were going down the weak consistency avenue for a long time, and then they came out with Spanner. Spanner is strongly consistent, distributed transactions with ACID guarantees, globally scaled. Not just a few machines running in the same city or a few machines running on the same continent. They’re trying to do this globally, and they’re doing it successfully. CockroachDB is an open source equivalent, essentially equivalent of Spanner. FoundationDB. Many efforts now are pushing us back towards strong consistency. I don’t think it’ll be one or the other. I think we do need to see this coming together of weak consistency and strong consistency, and caching plays into this as well. Frameworks need to evolve to make this simpler for application developers, because you will be building applications where you have different levels of consistency, particularly as we grow more modular in nature.

Accountability

Finally, accountability. The cloud does raise issues of trust, and legal implications of things. Third parties doing things that you didn’t want them to do. What happens if something goes wrong? Who was responsible for making that change in the ledger? In the absence of solid evidence, disputes may be impossible to settle. Fortunately, we’re seeing things like blockchain come on the scene, because accountability is fundamental to developing trust, and audit trails are important. In fact, I think we should have the equivalent of a flight recorder for the cloud, that records logs of service interactions, events, state changes, needs to guarantee fairness and non-repudiation. It should enable tracing back of incidents. In fact, if you’ve ever seen a hardware TPM, or Trusted Platform Module, you’ll understand where I’m coming from here. That’s a system that’s usually soldered onto a motherboard, and you can’t get rid of it without breaking the motherboard. It tracks all of these things, every keystroke, everything that comes in and out of your machine. It does it in a way that can be used in a court of law. I think that’s the thing that we will see more in the cloud.

Conclusions (Predictions)

I actually think Kubernetes will eventually disappear in the background. I’m not saying it’s going to go away. I think it is important. I think it’s going to remain the standard until something better comes along. Even if that does happen, my expectation is that it will sink into the infrastructure, you as an application developer won’t need to know about it. Just as, towards the last decade or so, you’ve had hardly if ever had to know about Java EE application clustering. You fire up another WildFly instance, and it just clusters automatically. You don’t need to configure anything. Compare that with what you had to do with JBoss AS, back around 2004, it’s chalk and cheese. Linux containers will be the unit of fault tolerance and replication. I think we’re pretty much there anyway. Application containers or something like them will return. By that I mean we’ll start to see these disparate services get colocated in the same address space, or something that makes them closer such that you don’t have to pay the network overhead as much. Now, that could be with caching, for instance, or it could be with lighter weight implementations, in some use cases.

Event driven will continue to grow in adoption. I do think frameworks need to improve here, and to abstract a bit more. It’s definitely a different mindset to think about asynchronous event driven applications. It’s not the same as your synchronous, for instance. My only product or project reference here would be Quarkus. If you haven’t looked at Quarkus, please do. The team is trying to make event driven very simple. It uses Vert.x underneath the covers, and it’s trying to hide some of that. If you want to know what Vert.x is there, you can get down into the details, but generally, they’re trying to make it simpler for application developers. I think immutability from Kubernetes is actually a pretty good approach. It’s a pretty good architectural style. I think that’s probably going to grow because of Kubernetes. I think generally, it’s also useful to consider as a mindset. The application developer will continue to help evolve distributed systems. It’s happened for the last 50 years, I don’t see it changing for the next 50 years.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Entity Framework 7 Brings Bulk Operations and JSON Columns

MMS Founder
MMS Edin Kapic

Article originally posted on InfoQ. Visit InfoQ

Version 7 of Entity Framework (EF) Core, Microsoft’s object-to-database mapper library for .NET Framework, was released in November, together with the wider .NET 7 release. The updated version brings performance updates when saving data, allows JSON column operations, enables efficient bulk operations, and contains many minor fixes and improvements. The EF7 Core is available for both .NET 7 and .NET 6.

Microsoft released EF7 on November 8th, distributed as a set of NuGet packages. According to the breaking changes documentation, the most important change in EF Core 7 is treating SQL Server connections as encrypted by default. The developers will have to configure a valid certificate on their machines or explicitly relax the security restriction. If not, the connection strings that were valid in EF6 will throw an exception in EF7.

One of the already-known advances in EF7 Core are performance improvements when saving changes to the database with the SaveAsync method. In some scenarios it can be over 50% when compared to EF6 Core on the same machine.

EF7 Core adds support for extending text columns containing JSON documents in the database into query objects. The developers can filter and sort on the JSON properties inside the documents as part of the query to the database. EF7 contains a generic support of JSON columns and a concrete implementation of a provider for SQL Server.

The bulk operations on the database, such as bulk updates or deletes, have also been reworked in EF7. A standard SaveChangesAsync method execution can affect multiple records, but the results of the SQL execution are loaded in memory as a result. EF7 now has two new methods, ExecuteUpdateAsync and ExecuteDeleteAsync, which will perform bulk operations on the server immediately and won’t load any entities back into memory.

By default, EF Core maps an inheritance hierarchy of .NET types to a single database table, in a strategy called Table-per-Hierarchy (TPH). EF Core 5 added Table-per-Type (TPT) strategy, where each type in the hierarchy gets a database table. EF Core 7 now adds the Table-per-Concrete-Type (TPC) strategy, where each non-abstract type gets a database table while the abstract type columns are added to the tables of the concrete implementations of the abstract type.

There are other improvements in EF7 such as support for custom T4 templates in database-first reverse engineering, support for overriding and changing default model conventions, improved interceptors and events, and mapping of inserts, updates and deletes to stored procedures.

While historically .NET developers have perceived the Entity Framework as bulky and full of shortcomings, the new versions are now recognized as highly-efficient, fault-tolerant ORM.

With the EF7 release, there is already a roadmap for EF8 with more JSON columns enhancements, .NET value objects support and the ability to return unmapped types as query results.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Eclipse Migration Toolkit for Java (EMT4J) Simplifies Upgrading Java Applications

MMS Founder
MMS Johan Janssen

Article originally posted on InfoQ. Visit InfoQ

Adoptium has introduced the Eclipse Migration Toolkit for Java (EMT4J), an open source Eclipse project capable of analyzing and upgrading applications from Java 8 to Java 11 and from Java 11 to Java 17. EMT4J will support upgrading to future LTS versions.

Organizations advise keeping the Java runtime up to date to get security and functional improvements. In the meantime, Long Term Support (LTS) Java versions will be released every two years and projects such as Spring Framework 6 now require Java 17. Unfortunately, adoption of new Java versions is relatively slow. For example, in 2022, four years after its release, Java 11 was used by less than 49 percent of the Java applications.

Upgrading an application to a new Java version means developers need to resolve all the issues introduced by the changes and removals inside Java. This includes functionality such as the removal of Nashorn, J2EE and Java packages, a change in APIs and access to the Java internals that have become more restricted.

EMT4J offers a Maven plugin (not yet available on Maven central), Java agent and command-line solution to analyze project incompatibilities with new Java versions and the output is written as TXT, JSON or HTML formats.

To demonstrate EMT4J, consider the following example application that makes a call to the Thread.stop() method that was removed in Java 11:

Thread thread = new Thread();
thread.stop();

After cloning the Git repository and configuring the Maven toolchains for JDK 8 and JDK 11, the project may be build with:

mvn clean package -Prelease

This results in a .zip file in the emt4j-assembly/target directory which may be extracted. Inside the extracted directory the analysis can be started. For example, from the command line:

java -cp "lib/analysis/*" org.eclipse.emt4j.analysis.AnalysisMain -f 8 -t 17 
    -o java8to17.html /home/user/application/classes

This analyzes the class files in the specified directory and displays potential issues when upgrading from Java 8 to Java 17 in the java8to17.html file. Alternatively the .bat or .sh scripts in the bin directory of the extracted archive may be used to start the command line analysis. The README file describes all available options for analyzing classes and JAR files.

The resulting HTML file displays the description, resolution and issue location:

1.1 Removed API Back to Content
1.1.1 Description
Many of these APIs were deprecated in previous releases and 
    have been replaced by newer APIs.
1.1.2 How to fix
See corresponding JavaDoc.
1.1.3 Issues Context
Location: file:/home/user/application/classes/App.class, 
    Target: java.lang.Thread.stop()V

Alternatively, the EMT4J agent may be used when starting a Java application, or the Maven plugin when building the project.

The project contains rulesets for upgrading from Java 8 to 11 and from Java 11 to 17. For example, the JDK internal API rule is used to verify is an application uses the JDK internals:


    
    	  agent
        class
    

The support-modes indicate whether the rule can be used with the agent mode and/or via the static analysis, class mode, with the command line or Maven plugin. Translation resource bundles are linked via the result-code, in this case JDK_INTERNAL which maps to the JDK_INTERNAL.properties and JDK_INTERNAL_zh.properties translation files inside the emt4j-common/src/main/resources/default/i18n directory.

EMT4J scans the application for packages and classes such as sun.nio and sun.reflect defined in the class-package-file jdk_internals.cfg in the emt4j-common/src/main/resources/default/rule/8to11/data/ directory

The actual rule type reference-class is located inside the emt4j-common/src/main/java/org/eclipse/emt4j/common/rule/impl directory, as the JDK internals rule has support-modes, agent and class.

@RuleImpl(type = "reference-class")
public class ReferenceClassRule extends ExecutableRule {

The existing rules may offer inspiration for adding custom rules by following the instructions in the README file.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: How Starling Built Their Own Card Processor

MMS Founder
MMS Rob Donovan Ioana Creanga

Article originally posted on InfoQ. Visit InfoQ

Transcript

Donovan: My name is Robert Donovan. I’m the tech lead for the card engineering team at Starling Bank. I’ve been at Starling, coming up to three years now.

Creanga: I’m Ioana Creanga. I’m an engineering lead for payments processing group at Starling. I’ve been with Starling for almost four years now. This is the presentation that we have on how we’ve built our card processor. It’s going to mainly focus on the architecture of it, and we hope that you’re going to see some design patterns in it, and decide to use some of them in your own architectures later on.

Outline

At this point, if you don’t live in the UK or you don’t work in the financial services domain, you might be having two questions. One is, what is Starling Bank? The second one is, what exactly is a card processor? For us to be able to give insight on how we built the card processor, we’re going to have to discuss about the more global payment architecture and how does the card processor fit into that more global payments architecture. You’re going to also find out what happens when you pay with your card at a terminal or at an online shop. We’re going to reveal our architecture diagram of the card processor, because this is why we’re all here. On the way, we’re going to discuss really interesting concepts like HSMs, MIPs, how PIN verification works, and what cryptograms are. For now, let’s just find out what is Starling Bank.

What is Starling Bank?

Donovan: Starling is a digital bank. You can see our lovely app here. We’ve won best British bank for four years in a row. We’re going to be talking about our cards that you can see here. Just to give some perspective on the scale of the bank, we currently have 2.8 million accounts, with £8.4 billion worth of deposits in them. From a card volume perspective, we currently process just over 2000 authorizations per minute on average, which is around 33 per second. As you can imagine, our numbers are going up very quickly every day. We need to be able to scale upwards quickly.

What Is a Card Processor?

Creanga: This slide promises to answer this question, what is a card processor? For us to tell you that, we’re going to go through a more global payment architecture and where does the card processor fit into that architecture. We’re going to be discussing all of these parties that are involved when you pay with your card at a terminal. After we explain all of these systems, we’re going to give a very simple example of a card authorization message. Let’s start with merchant. Merchant is the company that you buy from, so this is Amazon or Pret a Manger. For these merchants to be able to accept card payments from a card association like MasterCard or Visa, they need to have an account with an acquiring bank. The acquiring bank is simply the bank that holds the money for the merchant and doesn’t know anything about the customer that is trying to make the purchase. To get that answered, you will have to be able to put this transaction message into the payments network. We have there MasterCard, just because Starling supports MasterCard card networks, but it can be Visa or American Express. The payments networks role is to look inside a card transaction message and see what the primary account number for it is, which is what we call a PAN, which is simply just the number that is printed on the card. It needs to look inside, see this PAN and know where to route this transaction, so which card processor it needs to go to, based on that information.

Another thing that the payments network needs to be able to facilitate is the movement of funds between the issuing bank and the acquiring bank. When you pay with your card, obviously, money needs to leave your account and end up in the merchant’s account. This movement of funds needs to be facilitated by the payments network. Now we’ve reached the card processor. This is what we’re talking about. This is what this presentation is about. We’re going to spend a bit more time here. You notice here that the card processor and the issuing bank, they’re just two separate entities on this slide. The reason for that is that some banks don’t have their own card processor. They have this outsourced to another company, a third party company. Starling used to be the same. Seven years ago, when Starling Bank launched, we didn’t want to embark on this long journey of creating the card processor. We decided to just go for a third party company that does that for us. Now Starling Bank has two card processors, actually, because we’re still using the old card processor with this third party company, and we also have the new card processor that half of our traffic goes actually through this card processor that we’ve built that we are talking about now, and half of it is just through the old card processor. That explains why these two entities are separated on this slide. We’re just going to be talking about the card processor that we’ve built.

What is the main role of a card processor? It needs to be able to have access to the payments network. As you’ll see later on, this is not a trivial task. It also needs to be able to decode these messages that are coming from the payments network and verify cryptographic transaction data. The simplest examples of these cryptographic operations would be the PIN verification and the cryptogram validation, which is just a way of figuring out if the transaction has been tampered with on the way. To be able to perform all these operations, it needs to hold some secrets and manage some secrets of the card. Things like the PIN block, it needs to be able to store that. Things like CVV and expiry date. Another thing that it can do, although it’s not always the case, it can hold the account balance. For example, for the old card processor that we have, that is going through the third party company, they don’t hold the account balances for us, they just make a call to us to understand if the customer has enough money into their account or not. Of course, for the new card processor, we do hold our own account balances. These are the main things that the card processor needs to be able to do. At the issuing bank level, business rules can be applied on a card transaction, things like, does this customer have enough money, or a lot of other fraud checks will be performed at the issuing bank level.

Now that we went through all of these parties involved in a card authorization message, let’s just go through a very simple example of a card authorization. Rob here is not only an excellent coder, but he is also a great musician. His favorite merchant is Denmark Street. He goes to Denmark Street and he wants to buy himself yet another very expensive guitar. Because it’s a very expensive guitar, he can’t just tap into the terminal, he will have to put his card in and enter the PIN. This is what we call a chip and PIN transaction. Chip because the card has a chip, and PIN because he has to enter his PIN. What happens with the transaction afterwards is get packed by the terminal and sent over to the acquiring bank system, so we have there Barclays. We don’t know if Barclays is actually the acquiring bank for Denmark Street. What Barclays will need to do, is it doesn’t know anything about Rob because Rob doesn’t have an account with them. It won’t actually put any money into Denmark Street’s account, it will send it over to the payments network. It will mark this card transaction being in flight for Denmark Street, and then put us into the network.

The network in this case is MasterCard, because Rob has a MasterCard card. What MasterCard will do, it will look inside the transaction, it will see what is the card number that is printed on the card. It will figure out that this transaction needs to be routed to the Starling card processor that we’ve built. It reaches the card processor, and here we’re going to do all of the verifications that we need to do. We’re going to verify that Rob remembered correctly his PIN. We’re going to also figure out if the transaction has been tampered with on the way. We’re going to also verify, are there other things here? Also, business rules like, does Rob have enough money into his account? Or maybe in the past, Rob had decided that to remove any temptation from buying more guitars from Denmark Street, he decided to just block this merchant. We’re going to verify that here as well. We’re also going to do some fraud checks. Is this transaction likely to be fraudulent or not? This is what happens with a card transaction message. This is how it flows through all of these systems that are involved. The question is, if it’s such a complex job to build a card processor so that entire companies build their business on that, why did we even bother to go for this long journey of building our own card processor?

Why Build Our Own Card Processor?

Donovan: There are a few reasons why we decided to embark on this project. The first one of these is innovation. Starling has been innovating in the banking space since it started. Previously, there were certain restrictions on us being able to add new features that may be part of the card processing workflow. We really wanted to be able to bring this functionality in-house so we could create new products, where we’re in control of them from the moment they leave the MasterCard network. Alongside that is time to market. By bringing this into the Starling ecosystem, it means that we can release our card processor as quickly as we can our other services, so we do many releases a day. Whereas if you use a third party, they obviously have to prioritize work depending on the needs of all of their various customers. We can build new features very quickly and release them based on our own priorities. If bugs come up, then we can also fix those very quickly. That brings us on to resiliency. When issues do come up again, with a third party, they have to manage that incident across all of their customers. We can very quickly identify the problem and release fixes, or even mitigations on our side. Also, there’s this fact that MasterCard only give the issuing bank 7 seconds to respond to a card authorization. If you take longer than that, then MasterCard will make a decision on our behalf depending on the rules that we have set up with them. By bringing this in-house, that gives us the full 7 seconds to respond. Whereas using a third party, you generally get a much more restricted amount of time to execute our business rules. Finally, there’s the obvious one, which is a long term cost saving. By having this in-house, it means we don’t have to pay contract costs. It means that the interchange fees that MasterCard give us for each transaction, we can keep all of those, we don’t have to give a percentage to a third party.

What Are The Challenges?

What are the technical challenges on this? The first one is that we’re going to need some boxes sitting in a data center for this, firstly to connect to the MasterCard network, and also for our security purposes. We have these physical boxes, but we need them to be able to talk to our microservices architecture that’s running in the cloud, which is where we have all of our very fast release cycles and resiliency, and all of that stuff. Next to security, so we’re now going to start storing sensitive data from our customers that we didn’t previously need to store. Previously, that was the third party’s challenge to store that data securely. There’s a huge amount of complexity in building a card processor. This is because of the sheer number of permutations that you can get in terms of types of messages that you have, how you have to handle those messages depending on certain jurisdictions and regulations in different parts of the world. You have mobile wallets, online transactions with 3D Secure, as well as terminal based transactions and all of these sorts of things. Finally, we want to make it so that if customers do encounter issues with card payments, we could very quickly identify where those issues are occurring, and understand them quickly, so that we can fix them if necessary, or at least communicate to the customer what’s happened.

Starling Tech Stack

What’s the Starling technology stack? We have things running in the cloud. We currently run everything on AWS. We were using containers and EC2 instances, but we’re moving stuff over to Kubernetes now. We use infrastructure as code in the form of Terraform, for example. Our infrastructure team has done a brilliant job of this and really put the power into the hands of developers for releasing infrastructure changes. We can now make infrastructure changes as quickly as we can, and even standard service changes, which is fantastic. Our middle tier is pretty much all Java, backed by Postgres databases wherever that’s needed. Communication is generally RESTful, but we have some stuff that’s also just protobuf with Netty usually for performance reasons. We have a lot of this in our card processor.

Finally, the big mantra at Starling for our technology stack is keeping things simple. This is not a complicated, long list of hundreds of different technologies. We’re very careful about introducing new pieces of technology into the ecosystem. Whenever you do that, you have to spread the knowledge throughout the engineering team, such that if you do have issues, then anyone is able to solve those problems. Also, it often just introduces new points of failure, new things that can go wrong. These stacks worked perfectly well for us for years. Resiliency is often achieved simply through ensuring that these RESTful calls, or RPC calls are idempotent. That means that if a service becomes unavailable, the upstream service can just say synchronously retry for as many times as it wants, and not worry about any nefarious effects of a call being made twice.

The Card Processor Architecture

Creanga: This is the architecture diagram of the card processor. We’re going to divide it into three main parts, so we’re going to talk about it individually. What we have on the left is the Starling data center. This is where we keep all of the hardware equipment that we absolutely needed to have to be able to build the card processor. Right to the Starling data centers in this diagram, there are two VPCs in the cloud. This is what we have in the cloud. The one in the middle is called zero trust VPC. This is where we perform all of our cryptographic operations. Things like the PIN verification or the cryptogram validation, this all happens here. The reason why we call it zero trust and it might be a different concept for other people, we call it zero trust because there’s zero trust between the microservices that live in here. For example, to be able to communicate between each other, two microservices, we need to be very explicit and write infrastructure code to say that this communication is allowed.

As a tiny implementation detail on how we’ve achieved that, we use an AWS feature called NACLs, so Network Access Control Lists. For example, if we were to pick on two microservices in here, like MasterCard service and tokenization. MasterCard service is able to initiate and get responses from the tokenization service, but the tokenization service cannot initiate requests to the MasterCard service, it can only respond to MasterCard service. Every microservice in here has its own subnet, and we use NACLs to control the in and out traffic. The zero trust VPC here is not responsible for making any decisions about what should happen with the card transaction, if it should be approved or not. Its only role is to gather all of this data to learn more about this card transaction that is currently in flight, and send it over to core VPC where we have the card processor to be able to make that decision. This is the role of the core VPC, the card processor that exists in the core VPC is to perform all of these business rules that we’ve talked about. Does Rob have enough money for another expensive guitar? Is this transaction likely to be fraudulent? All of these business rules will be applied here in the core VPC.

Now let’s get back to what we have in the data centers. We have two different devices in the data centers, and one is called a MIP, MasterCard Interface Processor. Where you see the MasterCard logo, what that’s supposed to represent is a device called MIP. This is what we have from MasterCard, to be able to have access to their payments network. What normally these devices have are two Ethernet ports. One is to give access, to get access to the payments network. The issuing bank normally does not have access to that. We do not have access to that one. Another Ethernet port, we just plug in into the card processor itself. These are the devices that we get all of these real-time card authorization messages. This is where these are flying from.

Another device type that we have in the data centers are the HSMs, so Hardware Security Modules. People that work in financial services are already very familiar with these. We use them in many different ways at Starling, but in the context of the card processor, what we use them for is to get help with cryptographic operations like the PIN block verification, or the cryptogram validation. These HSMs that we have, they can be very generic, but the ones that we have for the card processor, they’re very specific. They’re so specific that you can literally send the command to the HSMs to verify the PIN block with the right arguments, the HSM will just verify it without you having to decrypt the PIN block, and look at the PIN in clear. This all happens inside the device itself. Another important feature of the HSMs is that they safely store the encryption keys. They have all of these different sensors on them so that if they detect that they’re being moved from the rack, they will self-destruct and wipe the encryption keys. That way you can just be certain that your encryption keys are just very safely stored.

This is about Starling data centers. We have obviously multiple data centers across UK, and multiple MIP devices and multiple HSMs for redundancy and resiliency. Because we’ve mentioned these commands that we can just send to the HSMs, let’s get a little bit into more detail about these PIN verification and cryptogram validation commands that we’ve talked about.

PIN Blocks

Donovan: When we talk about PINs, you normally think about just the four numbers that you put in, or sometimes more if you’re in other parts of the world. When we send PIN blocks around, when they’re in the clear, they’re actually in this PIN block format. The reason for the PIN block is to cryptographically tie the PIN itself to the customer’s account number that’s on the front of the card. The reason for doing that is that using a symmetric key for encrypting the PIN, you don’t want two different customers having the same PIN. You encrypt them, and then you could see that they’re the same. To avoid that, we have PIN blocks. There are a few different formats of PIN block. The one that I’m going to talk about is the one that we use, which is ISO-0 format. It’s a very simple algorithm to do this. We start off with this here, so this is just the length, PIN, followed by the PIN itself, and then some padding of 15. This is all interpreted as hexadecimal. Then we take the customer’s account number and the 12 right-most digits on that, and left pad that with zeros. The last digit at the PAN actually, which is a 1 here, is known as a Luhn check digit. This is actually based on an algorithm to validate that the PAN has been entered correctly. It’s not really for security reasons, historically, it was just to make sure that the customer hadn’t made a mistake entering their card number. This is an interesting little fact there. To create the PIN block, we just take these two elements, and we XOR them to get the PIN block. When we talk about the encrypted PIN and passing the encrypted PIN around, we’re actually talking about this PIN block.

The Chip and the Terminal

How does the chip actually talk to the terminal? When your card comes in contact with the terminal, or within a certain distance, if it’s contactless, the terminal induces a current on the chip. The chip has a very small operating system that runs on it. That operating system supports advanced bunch of applications which are effectively functions, just take some bits as argument and spits them out. The terminal can select one of these applications to run and provide information based on what the customer has entered, for example. It also needs to agree a verification method with the chip. They will effectively compare what verification methods they support. There’s an order of verification methods that we’ve programmed onto our chip in order of preference. An example for verifying the card holder would be for them to enter a PIN. Then the terminal is going to request a decision and usually send that authorization through to the acquiring bank, and eventually on to us through MasterCard.

Offline PIN Validation

There are actually two different versions of your PIN. There’s the one that’s stored on your card itself. There’s also a secure element on your card. Similar to very tiny little HSM on your card. There’s the online PIN. We’ll start off with the online PIN verification. This actually happens in quite a lot of cases, because the terminal will only go online based on certain rules that we’ve set on the cards. That might be because it’s above a certain value, or we’ve deemed it to be high risk because it’s cash related, for example, withdrawing cash at an ATM. A lot of the time it will just validate the PIN offline. That means that it’s going to start by getting a public key from the card. It then takes the PIN that the customer has entered into the terminal and creates a PIN block from it using the account number. It encrypts that using the public key that it got from the card, and then it issues a verify command, which is where the card is going to compare that with the PIN block encrypted on the private key, that’s stored on the secure elements.

Online PIN Validation

Online PIN validation is when we actually receive the PIN block ourselves and we validate it with the PIN block we’ve got stored for that customer. Really, this process just involves passing around the PIN block through different parties encrypted with different keys. We start off with the terminal that encrypts the PIN block under a key that is owned by the acquiring bank. The acquiring bank receives that and then it’s going to translate it under a different key. That’s a key that the acquiring bank will have exchanged with MasterCard. MasterCard received the PIN block encrypted on the acquirer’s key and we’ve also done an exchange with MasterCard with our own issuer key. MasterCard then translates the PIN block to be encrypted under that key, which is how we receive it. We also have the customer’s PIN block stored under a separate key. We can take these two things, we can take the PIN block that we’ve got stored encrypted under key B, and the one that we got from MasterCard, send them into the HSM and ask the HSM to tell us whether they match or not. The HSM does its decryption and comparison all inside its secure housing, so we never have to see it in the clear anywhere.

Application Cryptogram

The application cryptogram is an element of chip transactions, which is to protect against tampering. The way this works is that the terminal will ask the card to generate an application cryptogram. It’ll pass it certain information that’s on the transaction, such as the transaction amount, for example. The card is going to take that information and encrypt it using another key that’s stored on its secure element. The terminal then packages all of that up and sends it through to us. We receive the transaction details. We also receive this encrypted block that we call the cryptogram, which contains the same details. We can send that to the HSM and it will compare the values. We can decline if any of those values don’t match, which indicates that it’s been tampered with. There’s also a transaction counter in this block that we can use to prevent replay attacks. That’s some of the main cryptographic operations that we perform as a card processor. Let’s head back to the architecture.

Zero Trust VPC Architecture

Creanga: Now that you’ve delved in the fascinating world of HSMs, we’re going to go back to the zero trust VPC and what we have in there. We’re going to talk about what’s happening in each one of the microservices that live in here. The first two edge services that we have, they are MasterCard service and HSM service. MasterCard service, its main role is to accept messages from the MIPs and be able to decode them. Card payments globally have to comply to a specific format called ISO 8583. All of these card payments are coming into this format, and card associations have different implementations of this. Visa will have its own implementation, and MasterCard will have a different implementation. For MasterCard specification, what we had to do here in this MasterCard service is to decode all of these messages according to that manual. It was really hard work because this manual has over 1000 pages that we had to read, understand, and then implement. A lot of hard work went into this MasterCard service.

How does the MasterCard service even get the data from the MIP? What we have here, there’s this TCP/IP connection. It’s just streams of bytes just flowing through here. This is not all going over the internet. What we have here, there are VPN tunnels over a private network. We use AWS Direct Connect. That’s why we don’t have to send the data over the internet, we just use a private network. These are all encrypted stream of bytes going into the MasterCard service, which will then decode all these card authorization messages according to the manual that I’ve just mentioned, that’s over 1000 pages long. The decoding will just transform all of these bytes into a very simple Java object that will represent a card transaction that will just be passed over inside this VPC. This is MasterCard service. HSM service just acts as a proxy client to send commands to the HSMs. Now you know all about these commands that we’re sending to them. It’s the PIN verification and the cryptogram validation. It’s the same principle as the MasterCard service. There’s also TCP/IP connection between them. We use Netty I/O framework to get help with managing all of these network messages that are coming from the MIPs or from the HSMs.

Another thing about the HSM service is to be able to send all these commands to the HSMs, it will need specific data like the encrypted PIN block, or the track data and so on. It doesn’t have a database, so it will have to go through some other services to get the secrets from. You’ll see that this is done only if HSM service holds some access tickets that it got from other services. If only those tickets are valid, it will be able to get this data from the safe services and be able to perform its commands. This is all about the edge services that we have. Now let’s look what happens next, after we’ve decoded a card transaction message coming from the MIP? What happens next to the card transaction?

Donovan: The next step in the process is to gather the results of the cryptographic validation. Once the MasterCard service has decoded the message, the first thing that it does is it goes to the tokenization service to tokenize the customer account number into a unique ID. It does this using a hash function. The reason is just so that we are passing around something other than the account number, between the other services just as an extra layer of security. The MasterCard service takes that information plus what it got from the MasterCard network, sends it to this validator service. This validator service is going to create the access tickets that Ioana mentioned, to send on to the HSM service to be able to validate the cryptographic data on the transaction. The validator is going to get those results back, and it’s going to package them up and send them on to the next step to execute the business rules. It very deliberately does not make any decisions on what to do with the transactions at this stage. It’s just a dumb data and validation results collector. The next thing that needs to happen is the HSM service is going to need to get the data that it requires to perform that crypto validation so that it can send the results back.

Creanga: These are the microservices that hold the card secrets, like the PAN, or the PIN block, or Track safe. Track safe is just CVV and expiry date. These three microservices, they all store these secrets and they just hold simple pairs of a zero trust UID and whatever secrets it needs to store. A zero trust UID is simply just the hash function of the PAN. For example, for the PAN safe, it will have the zero trust UID and the PAN, for the PIN safe, it will have the zero trust UID and the encrypted PIN block. The way this works, what these microservice needs to do besides storing this, it also needs to be able to return these secrets to HSM service. They can only do that if HSM service is passing a valid ticket. They will all go before returning any data to the HSM service, they will go to the tokenization service who actually created these tickets to validate them. Only if these tickets are valid, it will return the data into the HSM service to be able to then send the commands to the HSM service. Let’s look in a little bit more detail on how this access ticketing system works.

Safe Access Tickets

Donovan: This ticketing system really works hand in hand with what we’re calling the zero trust model, provides. Each of these services has its own subnet. We have to define ingress and egress rules on both sides of the conversation in order for them to be able to talk to each other. The validator is going to need to create these access tickets for the HSM proxy to be able to get the data out. It creates those tickets and sends them on, but the HSM proxy itself does not have these ingress and egress rules, so it cannot reach the tokenization at all. The tokenization service has a number of different servlets that it exposes on different ports. One of those is for ticket creation and a separate one for validation. We can allow the crypto validator just to reach the creation port and the HSM proxy is blocked from reaching either. If Ioana here, for example, if she was upset with me about something, maybe because I’m spending too much time playing guitar and not enough time working, she tries to steal my account details. If she were to somehow manage to get on to the HSM proxy box, she could actually kill some of the endpoints on the safes. She wouldn’t be able to create a ticket to get access to that information. If she were to breach one of the safe boxes, again, she could call those endpoints. The safe boxes can only validate tickets, they can’t create them. They are blocked, there’s no rules for them to be able to do so. What will happen is the validator can create the ticket, send that onto the HSM proxy, all that can do is pass it on to the next step to get the data out. The safes will call tokenization to validate that ticket and return the data back if it is valid. These tickets also have a very limited lifespan. They will only be validated successfully. If they’re validated, it’s very short time after creation.

Architecture

Creanga: We’re now done with the zero trust VPC. We said that the main role of this VPC is to perform cryptographic operations and gather more information about this card transaction that is in flight. It will never make any decisions about if this card transaction will be approved or not. Crypto validator here, gather all the data and will pass it on to the core VPC where you see their card processor to make that decision. This is where we apply all of those business rules like, does Rob really have enough money in his account for this new guitar? Also, has he blocked maybe transaction to Denmark Street? All of these maybe other fraud checks that we have to perform, they’re performed here in the card processor. Also, a tiny implementation detail here, what we’ve built in-house is just the business rules engine. We didn’t use anything that was already there. It’s just a very lightweight business rules engine that we built here to help us with that. I think that’s about it. That’s all you need to know about the architecture.

Technical Highlights

Donovan: It goes back to the challenges that we mentioned earlier. As we said, this bridging between a data center and a nice modern cloud stack was achieved using AWS Direct Connect, in tandem with a VPN tunnel, so that we can have this nice, secure private network to communicate between the two, but continue releasing stuff very rapidly in our cloud environment. Then, using the security features in AWS to achieve our zero trust model has been really interesting, especially in combination with this ticketing system that we’ve developed. We think that’s a really nice robust way of protecting our data assets. Finally, this separation of the data gathering aspect of things from the business logic has worked really well. This means that when we do have issues, or we have customers asking us why a particular transaction was declined, for example, we know that there’s just one place we need to go and look. That’s on the business logic side on the core VPC. We can see the steps that it took through our workflow engine. The zero trust side of things just sits there and does its crypto stuff and just works. We don’t really have to touch that at all, which is great. We don’t have to look in loads of different places when you’re trying to debug these things.

Questions and Answers

Ignatowicz: I’d like to explore more because you mentioned a lot of microservices that you run in production, which tools or techniques do you use to do observability of those microservices running in production?

Creanga: I can tell you about the stack that we’re using, we use Prometheus for our metrics, and then we use Grafana for dashboards. Prometheus metrics is what we use for alerting, and so on. Obviously, we use some AWS features as well to alert on if some of the metrics that we have in there can alert us of anything going wrong?

Donovan: We use correlation tracking quite a lot as well, which is really helpful, obviously, when you’re following the logs. Also, we use Instana as well for tracing. We’ve found that really helpful with this architecture for finding where bottlenecks are, essentially. You can track all of those service calls through all the different layers and where the time is being spent.

Ignatowicz: You talked about a time that you do migration that you are running your solution and also the third party vendor solution. I assumed that from the terminal point of view, the payment terminal, it would be a different latency for both services. Do you have any strategy to route part of the traffic for one or for another, depending especially on the latency that I cannot control in a third party provider.

Creanga: The routing is done by MasterCard itself. When you set up a card processor with MasterCard, you will get that. Initially, when we had the third party provider that was one setup that we had with MasterCard, and that is based on the card number. The card number and the BIN range will tell you exactly what card will go through. There’s nothing that we can do on the fly to route them to the new card processor. For example, if you were to move everything, if you wanted to move the traffic, all the traffic to the new card processor, what you have to do is reissue all of the cards into the BIN that you have set up with MasterCard so that the traffic will be routed by MasterCard itself to the new card processor. It’s a high cost to do it, to the company, to the bank.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: How SeatGeek Successfully Handle High Demand Ticket On-Sales

MMS Founder
MMS Anderson Parra Vitor Pellegrino

Article originally posted on InfoQ. Visit InfoQ

Transcript

Pellegrino: We’re going to talk about how we at SeatGeek, we’re able to handle high demand and ticket on-sales, and we’re able to do that in a successful way, preserving the customer experience throughout the process.

Parra: I’m Anderson. I work in the app platform team at SeatGeek as a senior software engineer. I’m the tech lead for the virtual waiting room solution that was made at SeatGeek.

Pellegrino: My name is Vitor Pellegrino. I’m a Director of Engineering, and I run the Cloud Platform teams, which are the teams that support all the infrastructure, and all the tooling that other developers use at the company.

What’s SeatGeek?

Let me talk a little bit about what SeatGeek is. SeatGeek works in the ticketing space. If you buy tickets, sell tickets, this is pretty much the area we operate in. We try to do that by really focusing on the better ticketing experience. It’s important for us to think about the experience of the customer, whoever is selling tickets as well. This is our main differentiator. Maybe you will recognize some of these icons of these logos here, so we’ve been very fortunate to have partnerships with some of the leading teams in the world, not only for the typical American sports, but also if you’re a soccer fan, and maybe you know the English Premier League, also clubs like Liverpool or Manchester City, they work with us. If you buy tickets for them, you use SeatGeek software.

The Ticketing Problem

Let’s talk about what actually the ticketing problem is, and why we felt like this could be something interesting. When we think about ticketing, if you’re like me, most of your experience is actually trying to find maybe a concert to attend, or maybe you would like to actually buy tickets to your favorite sports team, or maybe a concert, or what have you. You’re actually trying to see what’s available, you’re going to try to buy a ticket. Then after you’ve bought the tickets, you’re eagerly waiting for the day to come for the match or whatever event that you have, and then you’re going to enter a stadium. This is what we would call the consumer aspect, or the consumer persona of our customer base. Maybe I own a venue, actually, I represent one of the big sports clubs, or I actually have a venue where we do big concerts. In that case, my interests are a little bit different. I want to sell as much inventory as quickly as possible. Not only that I want to sell tickets, but I also would like to know how my venue is. Which ones were the most successful? Which ones should I be trying to repeat? Which one sold out as quickly as possible? Also, I would like to manage even how people get inside the stadium. After I sold tickets, how do I actually allow people to get inside a venue safely at a specific time, without any problems?

Also, a different thing that we handle at SeatGeek is the software that we build, they run in different interfaces as well, which means that we have the folks that are buying tickets on mobile. We have the scanners that allow people to get inside a venue. Also, when they get inside a venue itself, they might have to interact with systems with different types of interfaces, different reliability and resilience characteristics. This is just one example of one of these physical things that run inside a stadium, like at the height of the pandemic, we were able to design solutions for folks to buy merchandise without leaving their seats. That’s just an example of software that needs to adapt to these different conditions. It’s not only a webpage where people go, there are a lot of things to it.

About the characteristics of ticketing itself, so in a stadium, there is a limited amount of space. It may be very large stadiums, but they all are at capacity at some point, which means that it’s very often especially for the big artists, or something, like the big events, usually you have far more demand than you have inventory for. Then what happens if many people are trying to get access to the same seats, how do you actually tiebreak that? There are concurrency issues, like maybe two people are trying to reserve at the same time, how do you tiebreak? Maybe you need to also think about the experience, like it is whoever saw the space first, it is whoever was able to actually go through the payment processor first. Who actually gets access to the space? This is something very characteristic of this problem.

Normal Operations vs. On-Sales

Most of the times we are in a situation where we call it the normal operations. People are just browsing the website trying to find things to attend. Maybe coming back to the example that I brought, I just want to see what’s available. To me, it’s very important to quickly get access to what I want, and have that seamless browsing experience. Maybe there is what we define as an on-sale. This is actually something that has a significant marketing push behind, there is usually also something happening in an outside world that impact the systems that we run. Imagine that I’m Liverpool, and I’m about to release all the season tickets at a specific time in a specific day. People are going to SeatGeek at that specific time trying to get access to a limited amount of inventory. These are very different ways that our systems need to understand. Let me actually show you how that looks like. The baseline is relatively stable. During an on-sale, you have far more traffic, many orders of magnitude, sometimes more. If you’re like me, before I joined this industry, I thought like, what’s the big deal? We have autoscaling now. We tweak the autoscaling, and that should just do it. It turns out, that’s not enough. Most of the times autoscaling simply cannot scale fast enough. By the time you actually start to see all this extra traffic, the autoscaling takes too long to recognize that and then add more capacity. Then we just have a bad customer experience for everybody. That’s not what we want to do.

Another example of things that we have to think during an on-sale. When we think about the tradeoffs, like the non-functional requirements, they might also change during an on-sale. Latency is one example. Maybe I would say that in normal operations, I want my website to feel really snappy. I want to quickly get access to whatever is happening. During an on-sale, I might actually trade latency for more redundancy. Maybe even during an on-sale, I’m going to have different code paths to guarantee that no request will fail. If I have a 500 request, it’s never a good experience, but it’s far more tolerable if that happens outside of a very stressful situation that I waited in line perhaps hours to get access to. This is something very important.

Even security. Security is always forefront on everything that we do, but detecting a fraud might change during an on-sale. One example. Let’s say that I’m trying to buy one ticket, but maybe I’m trying to buy a ticket not just for me, but for all my friends. I actually want to buy 10 tickets, or maybe 20. I have a huge group. How much is tolerable? Is one ticket ok? Is 10 tickets ok? Is 20 tickets ok? We also see that people’s behaviors change during an on-sale. If I actually open a lot of different browsers and a lot of different tabs, and I somehow believe that that’s going to get me a better chance to get access to my tickets during an on-sale, is that fraudulent or not? There is no easy answer to these kinds of things. This is something that may change depending on whether you are doing an on-sale or not. The main point here being that you must design for each mode of operation very differently. The decisions that you take for one might be different from the decisions you take for the other one.

Virtual Waiting Room

Parra: We’re going to talk about the virtual waiting room, that’s a solution inside of SeatGeek called Room. It was made in SeatGeek. It’s a queuing system. Let’s see the details of that. Before we talk about virtual waiting room, let’s create a context about the problem where the virtual waiting room acts. Imagine that we’d like to purchase a ticket for an event with high demand traffic, a lot of people trying to buy tickets for this event as well. Then we have the on-sale that starts at 11:00, but you arrive a little bit earlier. Then we have a mode that we’re protecting the ticketing page, then you are in the waiting room, it means that you are waiting for the on-sale to start, because we’re a little bit earlier. Then at 11:00, imagine that on-sale starts, and then you are settled in the queue and you are going to wait for your turn to go to the ticketing page and try to purchase the tickets. We know that queues are bad. Queues are bad in the real world and queues are bad in the virtual world as well. What we try to do, we try to drain this queue as fast as possible.

Why Do We Need A Queuing System?

Why do we need a queuing system? We talk about that scale policy is not enough. Also, there are some characteristics of our business that requires a queuing system. For example, we need to guarantee the fairness approach to purchase tickets. In the queuing or in the online ticketing, the idea that you can have the first-in, first-out, who arrived earlier has the possibility to try to purchase the tickets earlier. Also, the way that you control the operation, you cannot send all the holdings to the ticketing page, because the users operate reserving and then finishing the purchase tickets. In finishing the purchase, a lot of problems could happen there. For example, a credit card could be denied or you realize that the ticket is so expensive, then you say, ok, then you give up and then the ticket becomes available again, then becomes available for another user that they have the opportunity to try to purchase that. It means that we need to send our holdings to the ticketing page in batches. Also, we need to avoid problems during the on-sale. We’re talking about the characteristics of different modes that are operating in normal and the on-sale mode. The on-sale mode is our critical time, everybody is looking to us and we’re trying to keep our systems up and running during this time. We’re controlling the traffic, the opportunity that you can try to avoid problems during the high demand traffic.

The Virtual Waiting Room Mission

Then, what’s the mission of the virtual waiting room? The mission of the virtual waiting room is absorbs this high traffic and pipes it to our infrastructure in a constant way. The good part, this constant traffic we know we can execute for example loading tests, then you can analyze how many requests per second our system supports. Then based on this information, we can try to absorb that spike of the high traffic, and that pipes it in a constant way to our infrastructure in order to keep our systems up and running.

Considerations When Building a Queuing System

Some considerations when you’re building a queuing system. Stateless, in the ideal world, if you can try to control the traffic in the edge, in the CDN and avoid requests going to the backend, that you don’t need to render that request at that time is the best way. However, there is no state in the CDN. With no state means that you cannot guarantee the order. Order matters for us. Then we need to control who arrived earlier, then to drain those people from the queue earlier. Then usually when you have the stateful situation, the status controller in the backend, traditionally, then we need to manage the state of the queue in the backend. If you have the queue, then you need to talk about possibilities to drain this queue, could be for example random selection. We have a queue then you can select few users from that queue and remove and then send to the ticketing page for example.

Again, first-in, first-out is the fair approach that we can use in our business. We choose first-in, first-out. Also, we need to provide some operation, some actions that operators can do during their on-sales. They can do, for example, increase the exit rate of the queue. They can pause the queue because there is a problem in one component of our system. They need to communicate with the audience that is there, for example, if it is sold out, you can try to broadcast that information as fast as possible, because then people don’t need to wait for something that they cannot find. The important thing, metrics. We need to get metrics for all the components to analyze what’s going on. We need to identify the behavior of our system in terms of how long is the queue, or how much time a user was spending to purchase the ticket. Then, how was the behavior of our components during the on-sales? This is important to make decisions to improve our systems.

Stateless or Stateful?

For us, the main discretions was around stateless or stateful. With the stateless, with a simple JavaScript code we can, for example, create a rate limit where 30% of the traffic is routed to the queue page, and 70% waits in the queue. It’s simple, but in the end it’s only a rate limit, there is no queue guaranteed. Because the stateless work that we can have in the CDN, we try to implement a hybrid mode, where we can have part of our logic running in the CDN and part of our logic running in the backend. We took both pills, the red and the blue pill, where the state is managed in the backend. Then everything else runs in the state, runs in the CDN. However, when I mentioned that CDN is stateless, it’s not completely true because the modern stacks for the CDN, they provide a simple datastore. In this case, we use Fastly. Fastly provides a edge dictionary. We are using this edge dictionary as our primary cache. Then you have a problem, because we have DynamoDB in the backend that controls the state of the queue and controls the information of what we’re protecting. Our primary datastore is DynamoDB. Then we have the datastore in Fastly’s edge dictionary as our primary cache. We need to sync both these datastores. Then it becomes a problem.

Virtual Waiting Room Tech Stack

Let’s zoom out and see our tech stack. Each column means a main component in our solution. Then we have JavaScript running in the browser to open the WebSockets and control the state of the queue. We have part of our logic running in the CDN. We rely on Fastly, our backend in Go. We have Lambda functions, API gateway. Data storage is DynamoDB. For observability, we rely on Datadog. How is it connected? The traffic from mobile and from browsers goes through Fastly. Fastly is the main gateway. It’s the main entrance for all the traffic, and everything is controlled in Fastly. Fastly talks to API gateway that talks to the Lambda functions that talks to DynamoDB. In the API gateway, we have WebSockets, we have HTTP gateway. We also have some jobs running in the Lambda, then everything provides the entire solution.

How Does It Work?

We’re going to see in detail how it works. The virtual waiting room operates in two modes. The virtual waiting room, basically where the on-sale didn’t start yet, and then it blocks all the traffic that goes to the protected zone, and the queuing mode where we have a queue. Then we are draining this queue, means the on-sale starts, and both modes are protected zone. What is a protected zone in the end? It’s a simple path. Usually, it’s a ticketing page, where everybody goes there to try to purchase the tickets. Then we have some details, for each protected zone, we have a queue format, for each protected zone in production, we have over 2000 protected zones running now. Then you have some attributes for that protected zone, like the state. It could be blockade, throttle, done, or we’re creating or designing a new protected zone. We have the path, the resource that we’re protecting. It could be 1 or 10, depends on how the event was created, you have many dates or not. The details of that event. The limits as well for the exit rate. The idea that you can get in to the protected zone, a user will be redirected to the protected zone if they have an access token. With the access token, requests are routed to the protected zone.

The Virtual Waiting Room Main States

Then let’s see how we can get an access token to go to the protected zone. First of all, blockade. In the blockade, there is no access token. Also, there is no communication with the backend. Everything is resolved in Fastly, in the CDN. It means that the request for a specific ticketing page is fired by the browser, then Fastly identifies that that path is protected, and the protected zone is in the blockade. Then route back the user to the waiting room page, no communication with the backend, everything is stored in our primary cache in the edge dictionary. Then Fastly validates the state of that protected zone to route the user appropriately. Then, on-sale starts and then we transition the protected zone from blockade to throttle, and a queue is formed. When the queue is formed, it’s formed because the on-sale starts, then the browser fires the same request to go to the ticketing page. Then we identify that this particular path is protected by that protected zone. That protected zone now is not in blockade anymore, it’s in the throttle. Then we create a visitor token, and then we route back the user to the queue page with that visitor token. When you are in the queue page with that visitor token, we have a JavaScript that opens a WebSocket to the API gateway, send that visitor token through the WebSocket, and then get registered in the queue. In that moment, we associate the timestamp with the visitor token. Then that’s the way that you guarantee the order in the queue. We can sort by the timestamp, then we can see the order of the queue, how we’re going to drain this queue.

Then we have exchanger function that is running in a certain period, draining that queue, that’s basically exchanging visitor tokens to access tokens. We are fetching all the visitor tokens that was registered that they don’t have an access token yet, then you are exchanging. When we’re updating the database, we’re updating the DynamoDB, this data is getting streamed. Then we’re streaming the change from Dynamo to a Dynamo Stream. Then we have a function that consumes that notification from Dynamo Stream, and notifies back the user saying that we’re running. We have the WebSocket open and we’re taking advantage of that. You are sending back the access token. It means that the user didn’t ask for the access token, it’s a reactive system. When we identify that ok, you are ready to go to the protected zone because your access token was created, then you notify the user. Then the visitor token is replaced. There is no visitor token anymore. Then the user has only an access token, with that access token, you can get into the protected zone. The page is refreshed, the user sends that access token. Then with that access token, without any call to the backend, the access token is validated. The security is important. We try to identify that real users are trying to purchase the tickets, and then the user can go to the protected zone.

Behind the Scenes – Leaky Bucket Implementation

Let’s see behind the scenes, let’s see how those components were made. In general, we are using a leaky bucket implementation. If we’re navigating in the seatgeek.com, we’re not seeing queue all the time, but we’re protecting mostly events in the SeatGeek. The protection is based on the leaky bucket implementation, where we have a bucket for each path, for each protected zone, it’s a bucket with different exit rates. Then you can see that when request comes, if that bucket that’s protecting that zone is full or not? If it’s full, the request is routed to the queue page. If it’s empty, in real time, we’re generating the access token. If that access token is from the same mechanism that I showed before, that with the access token you can get into the protected zone. Then real time, we’re creating an access token, then we’re associating to the request, and the user can be routed to the target. That’s an example of how it works.

Then we have a simple Lambda function that was written in Go. The way that we’re communicating and caching information to Fastly, that’s the important thing. We can control the cache. For example, if you identified the bucket is full for run requests, it means that for the subsequent request, the bucket is still full. It works like a circuit breaker. When we identify the bucket is full, we are returning 429 to Fastly, as HTTP code, the 429 too many requests, and we’re controlling for how long it will be cached. Then in Fastly for the subsequent requests to the same protected zone, they say, they need to call again the backend to validate if the bucket is full or not, because the previous request told me that it’s full. Then we’re caching for a certain period, when it’s going to expire, we try again. If it’s still full, it’s going to return 429, and then we’re caching again. Then we’re reducing the amount of traffic that goes to the backend, when you don’t need to route requests to the backend, then you don’t need to route, you can figure out that path in the CDN. That’s the way that it works. Then in Fastly, it’s a simple VCL code. Then with the simple VCL code, we just identify the status of the request, and we’re caching it according to the status. We’re not here to advocate in favor of VCL. Fastly supports different languages. In a modern stack, we can use, for example, Rust. You can use for example, JavaScript, then you don’t need to use the VCL anymore.

Why Are We Using AWS Lambda?

Then, why are we using Lambda? Why do we have that infrastructure in Lambda? In general, in SeatGeek, for the product, we are not using Lambda we have another infrastructure running in Nomad that also runs on AWS. We have a completely different stack that runs the virtual waiting room in Lambda. Why Lambda? Because we’re trying to avoid cascade effects. For example, if we’re running our virtual waiting room together with the products that we’re trying to protect, and this environment gets on fire, then we have a cascade effect for the solution that is protecting that environment. Then it doesn’t make sense. Then you are trying to run aside of our product environment. AWS Lambda provides a simple way that you can launch from scratch that environment, and also supports a nice way to scale that environment based on the traffic, and is on demand as well.

Why Are We Using DynamoDB?

Why DynamoDB? Why are we relying on DynamoDB? First of all, the Dynamo Streams, it’s easy to stream data from Dynamo. With simple clicks in the console, you can stream the change that was made in the simple row for a stream that you can consume later. Comparing with MySQL, for example, to do that with MySQL, then we could use a Kafka connector that’s going to read the logs of MySQL and then get the logs and fires to a topic in Kafka. Then you can have problems with the team that supports MySQL. It’s a big coordination. It’s possible, but in the end, we choose Dynamo because it’s simple to stream the data. Also, because Dynamo provides a nice garbage collector. I think everybody is familiar with the garbage collector in Java, the idea that you can collect the data that you don’t need anymore. In our case, the data of the queue is important during the on-sale. After the on-sale, you don’t need that data anymore. In the normal operation, the size of our database is zero because there is no queue formed.

DynamoDB Partition – Sharding

There are some tricks regarding DynamoDB. It’s not so easy to scale. There are some limitations regarding the partition. Designing your partition matters. For example, the default limitations is 3,000 read requests that you can do, or 1,000 write requests in a specific partition. Over that, Dynamo starts to throttle. When we throttle we need to try to think in some mechanisms that you can have supported throttles, because if not, the user is going to receive a fail. You could, for example, retry. If you retry, you can increase the problem again. We are basically using sharding. The sharding is a way that you can scale the limits that Dynamo supports. Then, for example, with that simple code in Go, we are creating 10 different shards. Where we design our partition, our hash key, that is our partition key, and then we are appending the sharding to the partition key. Then we can increase the amount of traffic that Dynamo supports. For example, if we’re using 10 shards, you can multiply the supports by 10, instead of just for 3,000 requests per second, you’re going to support 30,000 requests per second, from 1,000 writes requests per second to 10,000 requests per second. It’s not for free. If you need, for example, to run a query, it means that you need to deal with 10 different partitions, 10 different shards. Then the full table scan means 10 shards that you need to go through and try to fetch the data. You can run that in parallel, then you can try to have different combinations and avoid throttles in DynamoDB.

Sync Challenge: DynamoDB – Fastly Dictionary

Then, finally, let’s talk about the sync challenge. We have the edge dictionary in Fastly, and we have DynamoDB as our primary datastore. We have the advantage of the Dynamo Stream. That problem to write to a table in the database and talk to another system, it’s a common problem, then it was addressed by the transactional outbox pattern. For example, if you’d like to write to users table in MySQL, and then fires a message to the random queue, there is no transaction bound on that operation. Then you need to find a way how you can try to guarantee the consistency between those operations. One way is the transactional outbox pattern, basically, we don’t call the two datastores at the same time. When the request to change the protected zone, for example, arrives, it talks to DynamoDB, then get updated in a single operation. Then this change is streamed, and we have a Lambda function that consumes that change in the Dynamo table, and then talks to the Fastly edge dictionary. Then if the communication with the Fastly dictionary fails, we can retry. You can retry until it succeeds. Then this way, we are introducing a little bit the delay to propagate the message, but we guarantee the consistency, because there is no distributed transaction anymore, there is no two components, two legs in our operation. It’s just a single operation, then we’re trying to apply that in sequence.

Let’s see an example of the edge dictionary. The edge dictionary is created as a simple key-value store in Fastly, then there’s a simple table that is in memory. Fastly offers an API that you can interact with that dictionary, that you can add items and remove items. This is an example of the code in VCL. With the same code in Rust, you can achieve the same, then you can take advantage of the edge dictionary, and also you can take advantage of the modern stack that you can run in the CDN using that kind of code. If you’d like to know more details about it, we’ve made a blog post together with AWS regarding how it works and how we are using AWS to help us to run our virtual waiting room solution. We have the link here.

Virtual Waiting Room Observability

I’d like to talk about the observability that’s the important part of our system, how we’re driving our decisions based on metrics, based on the information that we are collecting. We have different kinds of behaviors. We are relying on Datadog to store our metrics, but also we are using AWS Timestream database that provides us long term storage. Ideally, we can monitor and observe everything in terms of if you can see problems in the latency of Lambda functions, how Fastly is performing, how many protected zones do we have in our system? If there are errors, what’s the length of the queues? How long are users getting notified? All the dashboards provide the operators and the engineers a vision of what’s going on during the on-sale. We are also trying to take advantage of that to provide sensors. Sometimes we have traffic that’s not expected, then you can notify the users for example through their Slack, then they are going to be aware that it was unpredictable in terms of the traffic that’s going on.

Next Steps for Our Pipeline

Pellegrino: Ander talked about the current state, so the solution that powers 100% of all of our on-sales and all of our operations for a little bit over the past year. Let me talk to you a little bit about things that we’re looking to do in the future. We don’t claim that we have all the solutions yet. I would like to offer you a little bit of insight about how we’re actually thinking about some of these problems. The first thing is automation. Automation is a key important thing for us. Because as we grow, and as we have more on-sales coming, we start to see bottlenecks. Not a bottleneck, but having humans part of that process is not scalable for us. Our vision is to have on-sales being all done only by robots, meaning that a promoter can design their on-sales timeline. They can say, I’m going to have a marketing push happening at this time, and this is where the tickets can be bought, and all the rest is able to be done. Ander was talking about the exit rate, so we could adapt to the exit rate based on observing the traffic. We could also have different ways of alerting people based on, if that’s happening within the specific critical moment of an on-sale, that could have a different severity. Also, fraud detection is always an important thing for us. We want to get even more sophisticated about how we can detect when something is a legitimate behavior versus when something is actually an attempt of abuse.

Next Steps for Our Operations

I think that’s a very key point here, like our systems, they must understand in which mode they are operating under. That means each one of the services that we have, each one of the microservices, they should be able to know, am I running in an on-sale mode? That can inform our incident response process. Let’s say I have an issue that’s happening, people cannot get access to a specific event. If that’s happening outside of normal hours, it has a different incident priority, then if that actually happens during an on-sale, so the telemetry should also know that. I would actually like to be able to have each one of our dashboards reporting, what is my p99 for a specific endpoint, but actually, what’s my p99 only during normal operations, or only during on-sales.

Another thing that is critical for us and we’re making some important movements in that direction is around a service configuration. We do use several vendors for some of the critical paths. We would like to be able to use and dynamically change, so like say maybe I have not only one payment processor, but in order to guarantee that my payments are coming through, I can use several during an on-sale. Maybe outside of an on-sale that is not as important. SLOs, for us, whenever we define our SLOs, we need to understand our error budget, but we need to also be able to classify, what’s my error budget during an on-sale? We already do quite a bit of that. We would like to do that even further.

Summary and Key Takeaways

We talked about how it’s important to think about elasticity in all layers of infrastructure. Queuing are useful. People don’t like being in a queue, but they’re vital components. That doesn’t reduce the importance of actually designing elasticity in all the different layers. If you have a queuing system, maybe you should also think about how you’re going to scale your web layer, your backend layer, even the database layer as well. This was critical for us, like really understand the toolkit that you have at your disposal. The whole solution works for us so well, because we’re able to really tap into the best of the tooling that we had at our disposal. We worked with AWS closely on that one. I think we could also have done this solution differently using other different toolkits, but it would look very different. I would highly encourage you to really understand the intrinsic keys, and all the specific things about the system that you’re leveraging.

This is something that we started using, and it was a pleasant surprise. That’s a topic that we’re seeing more. I highly encourage, maybe you have a certain type of use case that fits into datastore, into moving some of that storage over to the edge. It’s a relatively recent topic. It works for us. I would encourage you to give it a try. Maybe that suits you. Maybe you have a high traffic website that you could leverage, like pushing some of that data closer to the edge and where the users are accessing from in order to speed up some processes. I’ll just encourage you to take a look at that.

Questions and Answers

Ignatowicz: One of the main topics is about my business logic or my infrastructure logic moving to the edge. How do I test that? How do I test my whole service when part of my code is running in a CDN provider such as Fastly? How do I do some integration test that makes sure that all my distributed system that is becoming even more distributed, we’re talking now a lot of microservices run in the same code, but pushing code for other companies and other providers, especially cloud code? How do I test that?

Parra: For example, in Fastly, using VCL itself is not possible to run unit tests. That’s the big advantage of using VCL in Fastly. However, Fastly was the first language used by the Fastly CDN to provide coding running at the edge. Nowadays, there are modern stacks, like you can run Rust, you can run JavaScript. Golang is in beta. Then with all those languages, you can run simple unit tests. Because it is the same idea that you’re developing the backend, you are always trying to create small functions with single responsibility, and then you can test. For example, in our case, with the VCL, it’s not possible to run the unit test. What do you do? We run integration tests three times per week. We created a tool to run integration tests because it’s quite expensive to try to emulate the traffic. Ideally, we would like to emulate a user. Like we have an on-sale, then you’d like to put, for example, 100,000 real, open browsers, we’re not talking about virtual users, like you do using Gatling, or using the k6 that are frameworks to loading test. Then we’ll actually open 100,000 browsers and put in the queue. Then you can see all the mechanism, like each browser receives the visitor token, wait a little bit in the queue, then exchange the visitor token to access token, then go to the ticketing page.

When you try to use third party solutions like vendors to provide that kind of thing, for example, we have a contract with BlazeMeter. BlazeMeter offers that, but it’s quite expensive. We decided to build our own solution, we are using AWS Batch with play. That’s a simple one. We are treating load tests like a simple long batch job that we need to produce. Then every Monday, we launch 1,000 browsers that runs against our staging environment. Then when we wake up, we receive the report to see how it’s going on, if it ran successfully or not. Then the rest of the week we have small executions only for the sanity check. The drawback is that we are running VCL, the old stack. Then we don’t have the new test that is possible to run for every pull request. You do that during the week, three times per week.

Ignatowicz: Do you dynamically determine when queues are necessary, or those have to be set up ahead of time? For example, what happens if someone blows up, and suddenly everyone wants tickets for a concert that was released last week?

Parra: It was the proposal of the waiting room solution internally the SeatGeek. We are a ticketing company, and then sell tickets as part of our core business. Then we need to deal with this high traffic. When you have the on-sale, for example, on-sale next week. This on-sale was planned one month ago. The target of the virtual waiting room was to protect all the events in the SeatGeek. All the events in the seatgeek.com are protected by the virtual waiting room by the Room. Then Vitor mentioned about automation. In the beginning of the solution, we are creating the protected zones manually. It doesn’t scale. We have thousands of events happening on seatgeek.com, in the platform. Then now we have an extension of this solution that basically gets all the events that are published through seatgeek.com, and then create protected zones for exit rates. What’s the next step of this automation? We have the planning of all the on-sales, when the on-sale starts, when is the first sale? Then with all the timeline information of each event, we can decide when the transition is going to be applied. For example, if the on-sale starts at 11:00, we’re going to automatically blockade the path at 10:30. Then at 11:00 transition to throttle, without any manual interaction. Everything is automatic.

Ignatowicz: You do this blocking for all the events?

Parra: All the events.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.