MMS • Sheen Brisals Renato Losio Mary Grygleski Jonas Boner Thomas
Article originally posted on InfoQ. Visit InfoQ
Transcript
Losio: I would like to give a couple of words about the topic of today, what we mean by designing cloud applications for elasticity and resilience. Unlike the traditional approach that usually we, as cloud architects, tend to leave resilience to the infrastructure team, the roundtable will try to cover why application architects must focus, must put priority on elasticity and resilience. We’ll discuss, what do we mean by failure, other failures, how do we handle deployment changes? Our experts are going to provide insight into the challenges, not just the best practices but as well the challenges for building robust applications on the cloud.
My name is Renato Losio. I’m an editor at InfoQ. By day, I’m a cloud architect. I’m joined by four experts coming from different backgrounds, different company, different sector. I would like to give a chance to each one of them to introduce themselves and share their professional journey in designing applications for elasticity and resilience.
Grygleski: My name is Mary Grygleski. I’m currently AI Practice Lead for a consulting firm that’s based in Ohio, called Callibrity. We build custom solutions. My background, though, is that I was doing some developer advocacy, and it’s actually on streaming platforms and also reactive systems too. I’m quite familiar too, in fact, with Lightbend, with the Akka systems too when I was at IBM. I also run tech communities. I’m the Chicago Java User’s Group president, as well as now AI Camp, the chapter co-lead in Chicago.
Bonér: I’m Jonas Bonér. I started my career almost right off the bat with distributed systems. This was really before people started talking about cloud systems. It was about 20 years back. I’ve been working with clients for many years, trying to, initially, before cloud, help them to build multi-core systems and multi-node systems. Then later, when we got the cloud, actually being able to build systems for the cloud. Lately, the past few years, how to extend their applications out to the edge. I’m the founder and CTO of Lightbend. I created Akka as a tool to help initially, my clients, but then later Lightbend’s clients to build these types of applications that can make the most out of the cloud and the edge, and build systems that maximize all the amazing hardware and resources that we get in the cloud.
Fangel: My name is Thomas Bøgh Fangel. I’m a staff engineer with Lunar. Lunar is a neobank in Scandinavia. We’ve been cloud native from the very day we went live back in 2016. We started out with a monolith, but our platform has evolved into an event-driven microservice architecture, with more than 300 microservices. We’re using event sourcing as a persistence model for most of our core services. Event sourcing together with the event-driven architecture has been instrumental for us in building a resilient banking system in the cloud. Before joining Lunar, I actually also used Akka in another position for building distributed systems on on-premise deployments.
Brisals: My name is Sheen Brisals. I’m a long-term technologist. I started way back in the ’90s, and been through different technology ecosystems over the years. For the last eight, nine years, I’ve been associated or working with AWS and serverless cloud. As part of my previous job, I was with the LEGO Group. I am a co-author of a book called, “Serverless Development on AWS”, it came out earlier this year, published by O’Reilly. I’m based in London. I’m an AWS hero, and also a team topologist advocate. I speak and write a lot about technology.
Designing for Resiliency and Fault Tolerance
Losio: I’d like to start really with fault tolerance in applications. First of all, let’s clarify what we mean by fault tolerance, and how can we architect and engineer proactively applications to address fault tolerance? Let’s start to get into the topic at least at a high-level point.
Bonér: It’s a topic that I’ve been exploring my whole life it feels like, at least after university. The way I see it, we shouldn’t get nitpicky on terminology, but I rather talk about resilience than fault tolerance. Resilience for me is beyond fault tolerance. It is being able to always stay responsive and always give the user some useful answer back, even if it’s, I can’t handle you right now, but at least some communication back, not just flat die. Resilience, why I like that word is that it really means being able to not just sustain failure, but also spring back into shape. That’s literally what the word means. I’ve learned the hard way that resilience/fault tolerance, it can’t be an afterthought, it can’t be bolted on afterwards.
I remember back in the day when we had application servers, WebLogic, WebSphere and stuff, and a lot of people thought that you could just enable a feature flag, and then you have caching or distribution, and then you’re done. It’s all managed by the application server for me. I really believe that that creates a false illusion. It creates brittle systems, because I see failure management and resilience really part of the design of the application. As everything with design, it has to be there from day one. Failure is really unavoidable, especially if you move to distributed systems in the cloud or at the edge or wherever. I think it’s more healthy to see failure as not something scary. It’s not something that you want to prevent. Instead, try to embrace it and see it as a natural state in the application lifecycle, or in the component lifecycle.
Because, if it’s in natural state in your finite state machine, you might have started or stopped. You might have resumed, and you have failed. If you know that it’s a natural state, and it’s expected, then it’s quite easy to design for getting out of it in a graceful way. It’s of course a lot of things that tie into this. What I’ve seen a lot is how strong coupling, for example, can be blocking I/O or synchronous RPC calls or whatever, they can often lead to things like cascading failures. One component taking down the next, taking down the next, taking down the next. I think it’s extremely important to contain failure.
Losio: That’s a very interesting point that you already highlight one key message that we should take away already, is that failure is not just something that you should avoid, but it’s something that we should consider from the beginning and should think about it. I don’t know, Mary, if you share the same feeling that Jonas says about designing for failure from day one.
Grygleski: Yes, totally. I think I felt Jonas gave a very good introduction at this stage. The fact is that I think we need to look at failure as not just, ok, we’re reacting to it and just get things done and don’t care. The fact is, you want to bounce back and be back into the same state as if nothing bad had happened. That’s amazing, and that’s what I learned to appreciate too, this new way of looking at failures, when I worked on reactive systems, when I was with Lightbend, working with Akka, back when I was with IBM. I think it is very important to think in terms of that. I’m also learning a lot of things.
Another aspect is, how can you ensure that failure, when it occurs, what do you do? Another aspect I find interesting is using chaos testing, for example, to build up your whole approach to taking care of situations when failures occur like that. It contributes to that as well. This can be a very interesting topic that can have nonstop discussions, how do we look at it?
Losio: Sheen, I don’t know if Mary brought up the topic as well with chaos testing. In general, what role do you see testing playing in building a resilient service? Is it, as well with the serverless approach, something you have to think upfront, because as someone that is not really heavily involved on a day-to-day with serverless, my first feeling is it’ll be harder to test. That’s the common message I get from some developers.
Brisals: There are different layers when it comes to resiliency or fault tolerance. It’s not just that technical piece that we deal with. When it comes to testing, I’m a great favorite of integration tests, especially in the current situation where we have distributed systems and event-driven architecture. Spending time doing end-to-end testing, for example, it can be a waste of time, because in a distributed system, where is the beginning, where is the end? This is where you focus more on the integration testing aspect of what you have built.
Also, especially nowadays with cloud, don’t test the cloud because it’s been taken care of by somebody else. Focus on the failure points and where the traffic hits hard. Focus around those areas and test your system accordingly. Also, I think Mary touched on how quickly you can bounce back. This is where something we call in the book, the square of balance, where you are testing, you are observing, you are recovering. All should be in tandem, to work together to mitigate these failures and come back alive and provide the consistent service experience.
Losio: I was thinking, Thomas, as you come from a banking background right now, you ensure that there are scenarios where words like resilience, fault tolerance, and testing are quite high in the priority list from day one. What’s your feeling about proactively designing for fault tolerance or for resilience?
Fangel: I’d like to circle back to something that Jonas started talking about, namely that you have to embrace failure, so when in a cloud environment, but also due to other reasons, we certainly are embracing failure as a first principle. Whenever we have a distributed flow involving several services, we always think about, how can we design this flow in a way that embraces all the different failure scenarios that are involved in this? Since this is a natural thing for us to do when designing, it is also a natural thing to do when testing. We have to test all the different failure scenarios.
Later on, if an unexpected failure pops up in production, we also try to use this as a fuel for increasing our test coverage, for actually testing new failure scenarios when they occur. I think the very most important thing is to embrace that failure is the normal mode of operation, instead of only focusing on the happy path, which we may have a tendency to focus a lot on the happy path, because that’s where the value of whatever product you’re offering is. In the cloud environment, you have to also embrace failure as a first-class citizen.
Embracing Resilience by Design
Losio: I was thinking as in my experience, it’s easy to say resilience by design, we want to think day one about a problem. Then when I think about myself, my project, or whatever it is, often, I tend to be a bit lazy, postpone that thought, and figure as well, like even I’m more focused on the database side. If you’re like, yes, I might have to do sharding one day, but I know that day one, it’s easier for the long term. The investment at the beginning might be something I might say, let’s leave it for when I need it. Usually when I need it, the extra cost of working on that is an order of magnitude bigger. Do you have any feedback, for example, for a developer, how to start to really embrace resilience by design?
Bonér: It is, of course, more work and more complexity to tackle failure, initially. I’d like to talk about it in terms of accidental complexity versus inherent complexity. Imagining failure is really part of the inherent complexity. As Thomas said, you can’t avoid it. It’s there, regardless if you want it or not. It will happen usually when you like it the least. Tackling the inherent complexity of the application usually leads to less accidental complexity down the road. Because when you tackle it later, you probably went down a path that is not where you need to be. When it comes to practical applications, how to do it, that’s a quite deep topic, but some things that I think really works well is to embrace asynchronicity.
Ensuring that you don’t have blocking I/O, for example, because as soon as you block on something, or someone else, some other thing that you don’t have control over, you’re really sitting in that thing’s lap, in a way. You’re at the mercy of that other thing. It could be another component, a subsystem, some other product. That’s a really bad place to be. You want to be designing autonomous components that are in charge of their own destiny, truly.
Another important thing is making sure that you partition the application well, in terms of partitioning the dataset. That gives you opportunity for doing things like sharding. It really gives you a chance of creating these truly small, well-designed components that do one thing and one thing well, that contains logic in data. If you have that and they are autonomous and communicate without strong coupling to the rest of the world, they really have prevented most of the things that can go wrong.
Many people talk about bulkheading, which is extremely important. This really comes from the ship industry where you divide a ship into different compartments. If one of these compartments, a leak starts to happen, that’s fine. Usually two, three can start leaking and taking water, but the rest of the system still functions, because the failure is compartmentalized, it’s contained within each of these compartments.
I think that’s extremely important. In order to create bulkheads, as I said, using an event-driven architecture is usually a really good solid starting point, because that forces you to think about the scope of each service by themselves, and how they communicate with the rest of the world, what kind of things they should be triggered by. Meaning, what events am I interested in listening to? I’m not at the mercy of someone. No one can tell me what to do. I pick up the event and do something with it at my own pace, which is extremely important as well. It creates decoupling in time and decoupling in space, which is really ground level or foundational thing for designing these types of systems.
Losio: Do you have anything else to add or pitfalls we should avoid when we start working on it?
Brisals: There are many. Just to expand on what Jonas started, he talked about compartmentalizing. Early on, I said there are different layers to dealing with failures and being resilient. For example, recently, InfoQ, there is an article series on cell-based architecture. It’s not new. It’s coming back, actually, into the tech. It focuses more on defining a boundary, or the autonomous pieces that can stand alone and interact via loose coupling between the cells. Within the cells, as Jonas mentioned, you have single purpose microservices. They can again stand alone and they interact via events and APIs and things like that. The thinking itself should come from different levels. If you are an engineer or an architect, for example, when you think of architecting or designing an application, you should bring in something called the systems thinking.
Think beyond what is there. If there is a third-party application being integrated, ask, what’s the quota? What’s the throttling limit? What do I do if that is down? Then, as already mentioned, differentiate between your synchronous calls from the asynchronous part of your application. For example, if you take the insurance domain, so the quote generation probably is critical, whereas the insurance claim could be non-critical compared to quote generation. There are ways you can architect accordingly, considering the failures and fault tolerance and resiliency into these different pieces of the architecture. That’s my way of advising people to think as they start architecting or building solutions to bring in that thinking early on. It’s again, lifting towards the paradigm is shifting, to more to the engine rooms, rather than towards the infrastructure or the platform side of dealing with these things.
Unique Challenges with Building Resilient Applications
Losio: Thomas, as you mentioned before, the topic of not cloud and deployment. What do you see as unique challenges of building resilient applications? Basically, we are now talking about not anymore, just leaving to the IT department managing the infrastructure, but we are leveraging our cloud native application. What’s unique compared to a traditional infrastructure? I don’t know if, in your experience, you saw such a big difference in the way you handle resilience?
Fangel: Back when I was doing applications, not in the cloud, we really had a pet-like relationship with all our servers. The difference in the way we treat our servers in a cloud environment, pets versus cattle. Servers are just cattle in the cloud, you have no idea when a server is going down. The network is inherently unreliable. Whereas in an on-premise setup you have some ops people really taking care of optimizing everything.
That, again, comes back to in the cloud, you just have to accept that you have really no control over when the infrastructure is unreliable. That’s a big difference in the mindset when you actually develop that, we cannot trust anything here, so we just have to develop our applications with this acknowledgement that failure is going to happen at any place, wherever you are. This mindset maybe also fosters that you start to compartmentalize, split your application into smaller units that you can easily control. In that way, the actual environment and the nature of it fosters a way of splitting up your application in smaller domains where you have better control of failures. I think again, it’s coming back to the cloud just being unreliable, and that should guide the way you’re actually implementing your services.
Resilience at the Application Layer
Losio: I saw, you mentioned twice that we should really consider the cloud as not unreliable, but in the sense that, yes, we should always suspect that it might fail. Building an application layer on top of it, thinking that behind, everything can fail. Then we’ll come to Salt Lake City, because I feel there’s some connection. I see really as a challenge, the aspect to, ideally, I would like to build my application, and I would like to keep the cloud out of the scope. I don’t want to deal with IT. I don’t want to deal with what’s running behind. How can you do it really at the application layer? How can I really try to assume that everything is failing?
Grygleski: I think it is very true. Because now with systems deployed on the cloud, and we’re relying on some infrastructural support, managed service, for example, that is a problem of my vendor doing things for us. On the other hand, I think the application itself, we probably have new sets of criteria we need to think of. We now have this dependency. We have to think of the dependency. There could be something that won’t work on the dependency side. How do we address, or be prepared for that dependency type of failure that’s caused by things that are underneath on the cloud provider side? I think those are the kind of things, on an application side.
I’m a bank, I’m depending on the cloud set of services to do things. It just requires a new set of thinking to handle that. I haven’t done that. I cannot speak exactly. If I’m just thinking what I should be thinking of, then it would be, I’m doing systems, applications, I need to think of all those things. Plus, to me too, we can think of also the technical side, something could go wrong. I’m also thinking scalability, for example, those kinds of situations in which, let’s say I’m doing an e-commerce system, selling flowers, and then it’s Valentine’s Day, we’re going to get a spike. We need to worry about elasticity, those things.
I think, in terms of your application, you also have to think of the usage side, how it would be impacting the volume. Is it going to cause stress to your system on certain days? Or maybe like, some things, it could be a lot of IoT devices, and maybe you’re measuring weather element, it might change during the winter time when it’s real cold. Maybe like on a train system, the switch won’t work. In Chicago area, we get minus 30 degrees Celsius, and things like that. Those kinds of things, I think we also have to think of, besides just assuming, ok, the systems would fail. There could be weather element, usage patterns, and different times we also need to be thinking of, how do we handle it at certain times of day, of the year, something like that as well?
Rethinking How Systems are Built
Losio: Let’s say I’m a lazy developer, that I want to build my Tomcat web application, and I have that bit old attitude to say, ok, my application scales, someone else can take care of that on the cloud, either Auto Scaling group containers, whatever. It’s not really my focus, thinking about sharding, elasticity, whatever else. I’m building my web application. What’s wrong with that approach, and how can I change that approach?
Bonér: It’s a question that I’ve been getting less recently, but a lot in the beginning of Docker and Kubernetes, that the ecosystem around cloud in general, everything that we get from the cloud providers, but also the whole ecosystem around Kubernetes is extremely rich. There are products for almost anything. I feel like, sometimes all these richness can really come at a price. That we are drowning in complexity. It can be really hard to know which of all these great products on the CNCF website, or wherever you get your tools, are the right ones for my job? How do I use them first, individually. How do I use them together, building a single, cohesive system and so on. It’s usually very hard to be able to guarantee resilience and things like elasticity and so on in a composed system. Because things usually fail at the boundaries, where you delegate data or responsibilities between different things.
I’ve learned that the infrastructure and Kubernetes and all these tools as infrastructure as well, even though they do an extremely good job, they’re just half of the story. The other half of the story is what we’ve talked about so far here, that you need to rethink how you build systems. You can’t bring over old habits, ways of thinking, best practices from the old way of building things, from the monolith. The infrastructure and the application layer need to work in concert. More, I think, can benefit from actually being brought into the application stack itself, instead of delegating it down. Because up in the application stack, you have more control as a developer, you know what’s best for your application. You know the semantics that you want to maintain. You know what kind of guarantees are very important here, but less important here. You don’t need to maintain the highest level of guarantees for everything, because that’s also very wasteful.
All of these things get easier if you bring them into the application yourself. One analogy that I’ve been using in some articles and so on is that, Kubernetes is really good at managing, orchestrating, and scaling, and all of these things, empty boxes, but, of course, it matters equally much what you put inside the boxes. Just scaling and making available empty boxes, that doesn’t get you anywhere. Also, think what you put inside, and don’t just put inside old things, old habits, old tools, old best practices and so on. Try to see it as one cohesive, single system, as they’re working in concert between the layers.
Losio: I’m curious to see if there’s the same feeling from the serverless side, if you have a different approach on that, or if you actually share the same feeling about how to deal between the dichotomy of, I’m a developer, I don’t care about what’s running around as long as I run my Tomcat web app, or my lambda function that scales alone.
Brisals: That’s true. There’s a misconception in the industry that if you push everything to cloud, wonderful, everything will just magically scale or be there. That’s not the case. Because the way to look at is like, cloud providers provide set of tools, or services that you use to build, especially in the serverless ecosystem, that is mostly the case, because you have several services, you bring them together to build your application. When you do that, these different applications have different limits and capabilities. That understanding should be part of the application building process or the architecting process, so that you can build, you can identify the points of failures or points of stress, and then you can handle your application accordingly.
For example, if one application can handle thousands of requests per second, if another downstream application or service can’t handle that, then you have to think of buffering or mechanisms to slow things down. If something is not available, how do you retry and come back and submit? All these aspects are now more important than in the past, even though everyone uses cloud. That is the case. Because, as I said, these all set of tools or services, it’s up to us to bring out the best from these applications. Also, going back to what Mary said earlier, I think, in a way, we are scaring everyone to saying that failure is there: failing, failing, failing. Thing is, the reason why we are talking about failure these days is because the ecosystem has changed. For example, banking, everyone does the online banking, so that means we need to make sure that the online system is there all the time. Hence, we focus all these extra efforts to make it there, make it happen.
The Importance of Testing (Cloud Native Tools and Services)
Losio: There were lessons that I found interesting of, it’s not important just to use containers and Kubernetes, but as well think what we put inside those containers and how we develop that application. Do you see actually advancement in cloud native tools and services, that definitely are going to help in terms of resilience and elasticity. What are the challenges? As a developer, what should I be aware of? If I cannot really simply take them out of the box and solve my problem, what are the challenges that I have to think about?
Grygleski: It is very true. I also feel, actually for a developer, we’re more like dealing with the functional side of things. Sometimes then we like thinking, those are non-functional. It’s not relating to my business applications or goal of my thing. I tend to push aside, and now with the cloud, and we’re saying, ok, that’s what the thing is, we’re paying this vendor. They should be managing all of these non-functional things for us: the scalability side, the resilience, and fault tolerance, and fallback, and all these. No. Because, again, I think of this dependency now, it actually adds another layer of also complexity to applications, having to worry about what would happen if my dependency goes away or not working, that type of stuff.
Everybody else has talked about to redesign systems that are microservices, for example, are compartmentalized, and they can isolate all the failures. We can contain them, easier to manage. It is very true. We can think of scenarios that we figure out how to handle them. Also, another good way, like earlier too, I mentioned about chaos testing, in which I recently only got deeper into understanding is, really, you’re trying to simulate what would happen, and makes you be aware of uncovered areas that you wouldn’t have thought of. If you inject certain kind of fault, simulate some failure situations, how would you respond to that? I like to also cite an example why again, chaos testing, because I learned it at Netflix.
Actually, they had this Chaos Monkey testing, they simulate failure, and then it’s just all of a sudden, maybe network goes down, something goes down, then the whole system goes crazy, so you have to react to it. I feel that sometimes we may not immediately know all the failure points, but if we can simulate some situation and react to it, and we can uncover new areas that we didn’t think of because of certain unexpected things that happen. That type of stuff, I think it will help for us to understand our particular application better, to see how to react to it if something goes wrong. It goes hand in hand to, we can think of all the things that can go wrong. Have some testing to present us some unexpected things.
I’m into AI now, not that AI is perfect, but the fact is, could we also leverage on AI to help us with this kind of approach? Thinking of some kind of scenarios that we can react to it and uncover and think of better ways to address those things. That would be my thought too.
Bonér: I can also just emphasize that testing is very hard. You should, of course, test your integration testing and your unit testing and so on. When it comes to this context, I think the most important testing is usually testing at the edges, at the extremes, and try to force the system into an unhealthy state. I just remember, when we’ve been developing Akka, for example, we were having scripts messing around with the iptables to simulate the network going down, and then how can the application try to recover from that when we restore it, and stuff like that. Chaos testing, having Gremlins running around and messing with the system in very unpredictable ways is usually very healthy. Load testing comes into there as well, because that can usually trigger failures, and variable load, making sure that system can also go back to a healthy state, consuming less resources when the load goes down because these things go hand in hand. It’s a whole chapter on itself.
Losio: I was wondering if there’s any specific in the banking sector as well, for Thomas, I’m pretty sure that, yes, testing at the edge cases and testing is something that really goes quite deep in the mobile banking scenario. I was wondering if you have any advice to share, or anything in terms of mentality and approach by the developer, is different in such a scenario.
Fangel: Yes, of course, we’re in a heavily regulated industry, so we have to think about whatever we do, since we’re moving people’s money around. One of our principles is that no matter what happens, we should always know where people’s money are. If there’s a money flow and we’re moving money around, if an error occurs during such a flow, we should always still know where the money is. That’s one of the reasons we originally chose event sourcing as our persistent model. Since we’re persisting all the events happening in the system, we’re not persisting the current state of something, an account, whatever it is.
Losio: Eventual consistency is not something very common in the banking scenario.
Fangel: The persistence model in itself, gives you a lot of freedom for what you can do on the flip side of the event source. There’s a lot of freedom there as well, but that’s a completely different conversation. Of course, you have to embrace eventual consistency. There is a lot of freedom in that choice. Coming back to your question, we try to always think about, in the face of errors, will we always be in a place where we know what actually happened? That’s where the event source comes in, in our case, since we capture all the events happening, so we have implicitly or inherently in the system a log of whatever happened. When we test, yes, of course, we have to think about error scenarios and try to envision all the types of errors that can occur.
We are primarily a Go shop, so we are implementing everything in Go. When we started doing that, there wasn’t very much a tooling, like in the Java world, for example, or C#. We have built a lot of our things ourselves, and we’ve actually also built a workflow engine. In that engine, whenever something unexpected happens, we can park the flow in a fail with unknown error state. That’s one of the ways we handle errors that occurred that we hadn’t thought about, so not a part of the normal flow. Something exceptional happened, but we’re not treating it as an exception. We’re treating this error as a value that we can actually then say this flow failed with this unknown error, and then alert something about it. That’s one of the ways we handle unforeseen error scenarios.
Where to Start, when Building Applications (Greenfield/Brownfield)
Losio: I’m actually putting myself in the shoes of a not so lucky software developer. In this discussion, I’ve always assumed that we can start with a white page and start designing our brand-new system and think how to work on it. I was wondering if we have some advice for a very common case, I have my web application. I have my scenario. I have something already developed by moving it to the cloud, whatever. There has been not much thinking about elasticity upfront. There’s been much thinking about, I’m going to handle resilience. Do you have any advice how to, I’m not saying incrementally, but where should I start? Should I throw everything away?
Bonér: That’s something that I’ve been seeing a lot. Of course, it’s easier when you’re greenfield, when you can just start over. That’s almost never the case. Big teams have invested 20, 40, 100-man years into a system. You still want to modernize it. I think it can be healthy to look at the system in a vertical way. Let’s say you have services, but they are strongly coupled with each other. Perhaps, let’s say, you have a monolith, even, and you would like to move to microservices in the right way, making the most out of the cloud. It can be very often useful to think about your request down to database vertically, and then try to slice off, starting by one, one service with the full request chain, all the way down to the database, and slice that off, implementing that as a microservice, or whatever you want to call it, a cloud native service.
The problem means having some client on the monolith side that’s communicating with that service that is now liberated. Where communication actually happens over asynchronous messaging between the rest of the monolithic services and your liberated, now cloud native service. Try to rewrite that in the way that maximizes how we can execute in the cloud, meaning non-blocking I/O, ideally, down to the database. Perhaps reimplementing it, using event sourcing and so on, but taking it one slice at a time.
Often, you don’t need to reimplement the whole system. I would start looking for, where are my most critical services, services that absolutely can never fail? Then I just slice them off, and ensuring that I do contain failure that might happen in other parts of the system, meaning the rest of the monolith, and my service that can never fail. Or looking for services that are very heavily used, bottlenecks. It’s not always all parts of the monolith, or all services that really need to be able to scale truly elastically. It’s usually a few services that are hit a lot, either with compute or with requests or whatever. Try to focus on them and liberate them by slicing them off. Sometimes that can give you the biggest bang for the buck, in my opinion.
Losio: I was wondering if you have anything to suggest to a struggling developer dealing with how to handle an existing system and doing slice by slice, maybe, as we just said.
Grygleski: That is definitely a very good approach, like Jonas was saying about slicing it and tackling them one at a time. I support that approach too.
Designing for Failure as a First Principle, in an Agile World
Losio: How do you design for failure, and embrace it as a first principle in an agile world where everyone is going after happy path? Any best practice?
Brisals: It’s not just happy paths that we need to focus. That’s what I was saying, like the systems thinking, think of all the paths, including happy and unhappy paths. This is where you begin to question, where is the failure point? Where are the critical points that I need to consider? Jonas mentioned several aspects of looking at your system and thinking of the criticality pieces and synchronous versus asynchronous. That thought process should become part of the development process, then it will become easier to handle what we are talking about here. Look at all paths, non-critical paths, especially, not just the happy paths. That also should be part of the thinking. This is where some of the techniques that we use, like domain storytelling or event storming and things like that, to unearth the points that we need to focus in order to make our systems resilient.
Can AI Be Useful in Testing?
Losio: Mary mentioned testing where services are breaking, and if AI might be useful. Does the panel have any experience of using AI to generate list of components to test, and manners in which to test this component, for example, considering a single function?
Grygleski: Using AI, I think it’s definitely techniques we can start to think of, having AI to assist it. It’s not going to control it. I think, for example, some of these going through, again, a checklist of things we need to be checking for, and we basically have the AI to go and figure out some of these specific ways of testing. I see it as a very viable option. I haven’t done it yet. However, actually I’m starting to look into, for example, your data, data you need to be able to deidentify certain sensitive type of scenario.
Those kinds of things, I think we can definitely leverage AI to assist with coming up with test scenarios. That one I do see it as very capable of doing so. Because it’s leveraging on natural language processing, but there are also specific techniques of recognizing, like entity recognition, sub-named entity, so we can leverage on those type of techniques too, but without going into all the details. Certainly, yes, there are techniques in which you can use to help with generating test cases and testing areas too.
Choosing the Right/Best Tooling
Losio: When too much choice is a good thing or not, in the sense that as a cloud architect, I’m always struggling with the idea that, yes, I’ve always thought that it was much easier in the whole time, when there was no much choice, there was no much flexibility, but somehow it was easier to make choices. David is actually mentioning the difficulty, the challenge that comes with a large set of tools. There are a lot of tool choices for almost any activity. It’s always not very easy as well, to decide. How do you decide? Are you going to change the value in following the crowd?
Bonér: It depends. Of course, having abundance is great, and then having a rich ecosystem is great, but it is hard to navigate it. It’s hard to understand, because it takes a lot of time to understand the tools that are useful for me in my specific problem right now, and that are not useful. I really think it’s extremely important to not just follow the crowd, but actually understanding your problem space and understanding the problem, and try to use that as a filter to look at the right tools. It’s also not always healthy to follow the crowd, because the crowd is always hyped.
It’s just a lot of people using something because it looks cool and they have had great marketing, or the right thought leaders are using it. Try to bring it back to first principles. What is the problem I’m working on right now? That might vary even in the same application. Try to have good judgment on picking the right five candidates, let’s say, that might potentially solve my problem. On the other hand, very popular tools are usually popular for a reason. It’s not always that it’s always being hyped. It’s a little bit of trying to use judgment here. Start with your problem and not just take a tool, because others use it. That will probably lead you down the wrong path.
Checklist/Path for Resiliency Design?
Losio: When we talk about designing for resiliency, do we have a checklist or a path we can follow or really always depends on requirements? I’m thinking about framework.
Brisals: I don’t think there is one solution for everything. It’s a combination of different things, like we echoed early on. Part of it comes from the requirement business side of things, what do we need to achieve? Then you work with the constraints of the cloud or provider or the services that you deal with, identifying those things are essential in order to architect a solution accordingly. For example, concurrency matters, API quota limits, or throttling, all these aspects come into the mixer in the modern architecture thinking, before we start building anything, because all these thinkings are upfront. That’s why I said there’s no one thing that you need to focus, there’s a collection of things that will come together from the business, from the provider, to our own technical pieces that we build.
See more presentations with transcripts