MMS • Jean Yang Manuel Pais Arty Starr
Article originally posted on InfoQ. Visit InfoQ
Transcript
Kerr: I’m Jessica Kerr. I work at Honeycomb.
I have invited Arty because she wrote a book called “Ideal Flow,” which is all about the development experience in the code, and how can we stay in flow? How can we reduce the number of bad surprises we get? Jean because she is CEO of Akita Software, which is very much about giving us the visibility we need into the complexity that we have, rather than trying to smooth it over and pretend we don’t have it. Manuel, because he is the co-author of “Team Topologies,” which is this most excellent book. I now have a place to point people when I want them to understand what we need to do is make it so developers have a reasonable amount of cognitive load in their team structure.
The Big, Inherent Problem in Software Development
I want to ask each of you to start out with, what do you think is the big, inherent problem in software development these days? In the real world, not the FAANGs, not the VC funded startups, but the meat of the software industry, the salt of the earth. What do we keep running up against?
Pais: To me, obviously, there’s the aspect of cognitive load that we’re discussing here. I think we’ll go more in depth in there. Essentially, teams being asked to do so many things. Like you said, Jessica, and if you’re in a FAANG, or some company that has all the resources to have the super skilled people, that’s great. That’s not the normal case.
Kerr: We talked to Phillipa from Netflix, and she talked about layers of development experience teams. What are some of these things?
Pais: In my mind, the problem I see in many teams is that they’re being asked to do all these things, usually a lot of the technical aspects that they’re being asked to do, and then not think too much about the actual business side of it, the actual customers. Just, here’s the backlog, go and execute and worry about security, worry about testing, and all these things, which are obviously, super important. I don’t see the teams empowered to actually think about, are we making progress for our customers? Are we actually helping them and therefore helping the organization as well? This is not just, the management doesn’t give them the authority. It’s not just that. I think it’s both ways. I also see many teams that shy away from taking that ownership. Sometimes, it’s easy to just hide a little bit and say, “That’s for someone else to decide. It’s not up to us.” Why isn’t it up to you? You’re the people who are closest to the customers who are on the ground, and you can see the system running. You can see what the users are doing, and is this going in the right direction or not?
Kerr: There’s so many technical decisions to worry about that they don’t get to make the really crucial decision of, what should the system do?
Pais: Exactly. A lot of focus on technical aspects, I think, not enough on the actual customers.
Kerr: Jean, what do you think?
Yang: I think for me, it is heterogeneity. Here’s what I mean by this. If we look at most software systems, I have this analogy I like to make that they’re like evolving rainforests, not plant gardens. When people talk about tech stacks, especially the influencers we see online, they’re like, just use my service mesh, or just use my new framework and all of your problems will go away. The fact of the matter is a lot of people out there, they have decades old legacy software or they had a bunch of decisions they made, and they can’t just switch everything over. For me, I came upon this view because I was working in academia on programming language design, and I was preaching, if you guys just made good choices, all of your problems would go away. Someone came up to me after one of my talks, and they said, “Jean, that’s fine. That’s like going to the doctor and the doctor saying, ‘If you had just exercised and eaten an apple a day since you were born, you wouldn’t have these problems.'”
I think a lot of people see software heterogeneity as the sum total of a series of bad choices that people made. Really, it’s like what Manuel said, it is technical meets people. People make different choices over time. People make different tools over time. The result is people contributing to a software ecosystem are going to come up with something, a multilayered heterogeneous with many different kinds of pieces. One way to think about it is, it’s like cities, like Barcelona is one of my favorite cities, it has just many layers of history all together. Do you want to be Barcelona or do you want to be like some city that got frozen in time? Everyone wants to be Barcelona, and so we should let software have the tools to do that.
Kerr: You’d rather be Barcelona than like Brasilia, which was planned and nobody likes it. The planned parts anyway.
Yang: It gets outdated. Yes, all plans have to evolve.
Kerr: We have this series of choices that were made that they weren’t bad choices, they were just choices. Now we have this heterogeneity, and that gets back to all the things that Manuel brought up, there’s just so many different things to know and think about.
Starr: Over the last decade or so, we’ve added an insane amount of complexity to the software domain, in terms of the sheer number of things you have to know to be even basically proficient, has become extremely complex. I look at all this growing complexity versus what it is we’re trying to accomplish, and it’s easy to become enamored with these new technologies. That we should do it this way because this is the new and cool hip way to do things, or idiomatic changes being made just for the sake of, let’s do something fresh and cool and different. This will be a new, neat way to do it, so let’s update the libraries to work this way instead. Then all the people that are using those libraries, they have to go and rewrite all of their API connections and stuff to make that work. Then we’re on the new cool thing. Things are moving so fast, and we build on top of a tower of dependencies. As all these things are constantly shifting, we’re always keeping up with this complexity.
If you take a step back and look at, what are some of the Grails that we chase, of the goals that we’re reaching toward, that we’re trying to optimize for. It makes me wonder if some of these things that we’re chasing, actually are counterproductive. For example, one of the Holy Grails is that we should only have to write business code, and that all of the other stuff should be part of our platform or whatever. What ends up happening in practice, though, is we manage to get rid of cyclomatic complexity through lots of dependency injection and wiring things together. We don’t have a bunch of branchy IF code, that is our code anymore, but now all that complexity has shifted to the wires and shifted to the integration space. Now when something breaks, and you’re trying to troubleshoot what’s going on, now you’ve got a million different factors of things interacting with one another. Our troubleshooting costs have exploded at an insane rate, yet, because we’re only looking at, “My code complexity went down. I don’t have lots of branchy IF things anymore, I’ve got high code coverage now because all this stuff went in the platform.” All of these metrics and things we chase as the good things we’re aiming for, show that we’re doing good. All this complexity that has exploded in other places, we don’t really have a good way to keep tabs on, so it’s become part of this invisible blind spot space where all of these costs have exploded.
Kerr: Our aesthetics, we’ve applied them to the code we’re working on right now, but as a result we’ve made a bunch of things that used to be explicit, implicit and harder to see.
Starr: Those costs are very real, though. When something breaks and you diagnose things and figure out what’s going on, they still affect us every day.
Akita Software and One-Click Observability
Kerr: Does Akita address that? Is that one of the things you are addressing?
Yang: At Akita, our dream is one-click observability, so to make it as easy as possible for people to understand how the different components of their system are talking to each other. We address some of it, but not all of it. We address cross-service, cross-API complexity. We’ve decided to take an API centric view of the world. Anything that happens across APIs, we’ll tell you about, drop us in, we’ll give you an 80% solution on day one, we like to joke. We’re not a power tool. We want to make it so that anyone can install us within a few minutes, and we’ll give you something as soon as possible. We’ll tell you this is how your APIs are talking to each other. This is what your API graph looks like. This is the data types you’re using. This is what the endpoints are structured like. This is how that’s working. I totally agree with everything Arty says. That’s a big part of the motivation. For me, the way I saw it was, we can’t solve all parts of the complexity problem, so we’ll just start with the points of communication across the network. I would love to see more tools that dug in there and did all the other parts.
Microservices and Complexity in Connections
Kerr: Because one way that complexity gets pushed out of the code you’re looking at is when you move to microservices, and each one is nice and happy and cute, but the complexity is in the connections.
Yang: Yes. I’ve had a series of conversations with people who they’ve said their org has taken on microservices, because they think it’s a technical solution to what’s actually a human problem. It doesn’t actually make any of your problems go away, it creates more problems. For us, some of our users aren’t even super microservices based, maybe they’re on a monolith. Maybe they have a handful of services. Everything calls out to a lot of stuff these days, so you are making network calls to your datastore, your data stream, your third-party services, like everything is living in this network based ecosystem.
Kerr: Even if you think you have a monolith, it’s still a distributed system.
Yang: Absolutely.
How to Solve Complexity, From the Human Side
Kerr: How do we solve this from the human side?
Pais: Unfortunately, there’s no silver bullet. Obviously, I agree with Jean, when she’s saying microservices are not the answer. They’re not the first answer. This has a lot to do with cognitive load. We’re thinking about teams, and obviously, from Team Topologies perspective, we think of the team as the unit of delivery, not the individual. When we think in particular about software architecture, we should be thinking at the team level, what is the size of this service, or this thing that makes sense for one team to handle? That they have enough capacity to build, test, operate, but also understand, what is the actual business value of this thing, of this service? Maybe this is more like macroservice, or whatever you want to call it. Looking at the architecture at the team level, I think it’s a first step. Then maybe you’re looking to microservices as the more technical implementation, and there’s a lot of good patterns to adopt.
When you do that, you also need to be mindful of how much cognitive load this is going to put on the team, if we’re now forcing them to adopt all these and take care of all these failure scenarios and all the things that make it complicated to manage and run microservices. What do you have in place in the organization to help teams address this cognitive load? Do you have maybe some enabling teams or teams that help learn the skills around microservices? Do you have proper platform services to support the microservice architecture? What do you have in place? You can’t just put it on the teams to now go to microservices. I see that with several clients where, ok, we want to move to microservice. That’s the right thing to do. Then they don’t think about, what is the support system so that the cognitive load of the teams doesn’t explode? Teams that are used to work with monolith, are used to work with different kinds of sets of constraints and problems of a monolith versus microservices.
Team Cognitive Load
Kerr: Building the team size or the software that you assign to a team, usually, traditionally, we think in terms of how much work can we assign to a single team? How much work is happening in this piece of software right now? You and your book propose that we think instead about how much is there to understand? How many of these things? How much of that heterogeneity? How many layers of choices are we putting on this team?
Pais: Exactly. The team cognitive load can be or should be a key factor in several decisions. I mentioned the architecture. Also, if we want to help this team manage their cognitive load, what is the right approach? What I see often is a very limited set of options. Like Jean was mentioning, that teams or organizations think that, if you just adopt this new framework or this new tool, it’ll make your life easier. Actually, in practice, it often increases cognitive load. The other approach is, we need someone to join the team. We need an expert in this new thing. We need to worry about, for example, we don’t know how to do CI/CD or observability very well, we need to hire someone. Actually, is that going to help address cognitive load or not really? It depends. For example, with CI/CD, maybe you don’t want to dive into the details and get to now worry about all the details of your CI/CD infrastructure and tooling. Maybe you’d like to have either a platform service or you’d like to use a third party. That’s what’s going to help minimize cognitive load, not hiring someone to take care of the CI/CD. Because then we’re taking on more cognitive load by hiring this person.
Kerr: When they bring that expertise onto the team, that’s that many more things that the team as a unit of delivery needs to maintain, and collectively understand. Especially if you hire a consultant to do it, and then let them go, and then everything.
How to Know When a Team Has a High Cognitive Load
How do you know, what are some of the signs of a team being at too high level of cognitive load, as opposed to being too busy?
Starr: I feel like I want to start with just defining what cognitive load is, because it’s a super abstract term. I feel like that would help and give me a way to answer that question of what it looks like when we’re struggling in terms of feeling overwhelmed with cognitive load. The way I think of this is, humans have a specialized capacity for focused conscious attention. Our focused conscious attention, it works like a flashlight. You can think of having a flashlight that is focused in one spot, or you can have a flashlight that is diffused. We’ve got this flashlight, and what is our flashlight doing? It’s scanning for things that are noteworthy. Then as we take note of things, so each thing is a thought that we have to hold on to in our brain while we’re trying to accomplish something. When we do this focused attention thing and try and take note of all of these things, it doesn’t happen in a vacuum. We aren’t staring at a piece of paper, we’re usually trying to solve a problem. For example, if I write some code, and then I run a test, and I’m expecting the test to return five results and it returns no results. What happened? Now I’ve got a puzzle that I’m trying to solve. I’m confused, and I’m trying to figure out why.
Then I’m going to scan the areas of code with my flashlight, and I’m going to pick up what is noteworthy as to things that might be relevant to reasons for why the code might be returning no results. Each one of those things that might be noteworthy in the code somewhere are things I’m going to try and hold on to in my intentional memory. Our cognitive load is affected by how many of these little things we have to keep in our intentional memory in order to solve a problem. If we walk away and go do something else, and we come back, all those little things aren’t in our memory anymore. These type of dynamics where task switching, for example, going and doing something else, going to lunch and coming back, affect our ability to pay attention to these things. We have to load all of these things in our brain, all of these noteworthy things to be able to solve the puzzle at large.
There’s this dynamic of us trying to absorb lots of things in our brain, and pay attention to those things and focus on that long enough such that we can solve the puzzle at hand, when there’s lots of little things that we have to pay attention to, in that sprawling because of the complexity. For example, if I just had an exception that was thrown that told me, you’re getting a NullPointerException. You know exactly where to go to solve the NullPointerException, usually, because you’ve got a line of code that you go through right there. Something like, there’s no results returned, there’s a whole bunch of different ways that that outcome can occur, that can be really tricky to end up resolving. The nature of the dynamic of how things are failing, and what observable clues that we have to diagnose those things, how much they sprawl across different systems, you can imagine all of the different sorts of things that are noteworthy that we have to pay attention to.
Things that you might see symptomatically, if a team is struggling with this, is troubleshooting times in particular, go through the roof. The number one factor I would pay attention to, is the tools I’m working on building, raise an alarm when you hit that confusion state when you start troubleshooting something. Then that gives you the ability to swarm on that and help one another if you need to, with regards to troubleshooting a problem. When those troubleshooting times skyrocket, usually there’s a problem with those cognitive load factors surpassing the team’s capabilities to do. That long, focused, conscious attention, it drains energy. It is work for the body, for your brain. Your glucose levels are dropping as your brain consumes all this energy trying to focus on these things. When troubleshooting skyrockets, and you’re spending lots of time super hyper-focused trying to solve these problems, the team gets exhausted from the stress of that.
Kerr: Exhaustion and stress, and in particular time spent troubleshooting are a really good sign that the team has too many things to think about and trace through.
Jean, what would you look for?
Yang: I really like everything Arty said. I think that’s also what I would look for in the people. I think in the technical artifacts, related to what Arty said, I would look for every time someone has to troubleshoot something, or any time someone has to figure out what’s going on, are there tools already to help with that? If people have to keep information in their own heads about, here’s my model of how the technical framework all works together. Here’s my model of how the system works together. That’s going to increase troubleshooting times, that’s going to increase the load on everybody. A lot of my goal has been, let’s take some of that and make it explicit. Let’s build tools that show your team how stuff is connected. Let’s build things so that people don’t have to put the burden on themselves of documenting their API, how their API talks to other APIs. Let’s build tools that take the things that people right now have to either talk to each other in words about or they have to reconstruct in their heads by looking at logs, by looking at traces, by looking at other kinds of things and actually doing something explicit with it, automating that process. What I would look for is, are there tools that take load off of the software teams, when it comes to modeling what’s going, when it comes to automating some of the communication about what’s going on? If there isn’t, chances are, they’re overloaded. Troubleshooting takes a long time.
Kerr: You’re looking for pieces of knowledge that are like tribal knowledge. Maybe they’re partially documented, and people talk about them word of mouth and on the whiteboard. You’re looking for that implicitly no knowledge, and then looking for tools that make that explicit without having to document it, manually?
Yang: I think that there’s this assumption that how stuff works is implicit folk knowledge. It doesn’t need to be, it just is because we don’t have things that help us map that out right now. A very simple example of something that reduces cognitive load, although this is very controversial, is types. I have many friends who will say, I can’t imagine what it’s like to work on a large JavaScript code base without TypeScript, because how do you maintain that? How do you talk to other people? This was a controversial one, but types are one example of once you make that explicit, you don’t have to talk to a human about that anymore. I also like the example of types, because once type inference got good enough, people didn’t have to sit around writing all their types all the time.
Kerr: It made it not so hard to get that explicitness.
Yang: Exactly. I think there are many other dimensions of our software systems besides the types where we could take off some of this load. Because right now, most things people think of it as, you have to ask so and so over there, if you want to know how this works. Or, let’s cross our fingers really hard and hope someone documented this, so we don’t have to dig through the code. If we can build more tooling that automatically builds the models, or automatically builds some documentation that can take a lot of load off.
Kerr: Software to investigate our software.
Manuel, what do you look for?
Pais: First, I wanted to say I really liked the flashlight description, Arty. I’m probably going to use that, if that’s ok. It’s a very visual way to think about cognitive load. Also, just wanted to add, that you have different types of cognitive load, if you go to the psychology definition. I think when you have that flashlight, and then there are different things that are playing a role, so you might have some foundational things that you need to know. If you’re working on JavaScript code base, you need to know JavaScript, how it works, how it’s a non-typed language. If you don’t have those foundations, because you don’t have the experience, or it’s something that’s new to the team or whatever, then the flashlight is going to be pointing at that. You’re going to maybe do a lot of Googling and finding out answers. That’s one type of cognitive load, which we have good techniques, like pair programming and more programming, are so important to help with that. Because then you can get, maybe it’s that tribal knowledge you were talking about, but at least have people together learning, and the people who are less experienced can learn from others. Often, there’s a very reductionist view on those techniques, as if, it’s not so productive, because you have two people working on the same task, but you’re not thinking about all the other things. If you’re actually reducing cognitive load over time, that’s actually pretty efficient. That’s one type.
Then there are other things that are distractions sometimes. To take that example where we’re running some tests, and we don’t get any results, if it’s working with a database, maybe the database was not accessible so now we will have to go and think, how do these tests run? Which database do they connect to? How do I enter the database and see if it’s running or not? Those are all distractions that are also part of the cognitive load. Now we started going off in this tangent where we’re no longer thinking about, what did we actually want to test? We’re thinking about all this other stuff. All this reduces our capacity to be thinking about actually, what did we want to do? What was the scenario we wanted to implement? What are the user personas that are going to benefit from this change we’re trying to make? What is the business context? What are the failure scenarios we need to deal with? Those three things. There’s foundational things that if you don’t have in place, like skills, that are going to take up your capacity. There’s distractions, things that you need to worry about, because you don’t have the right support, or the right practices, or what have you in place, or computers happen. That often leads to, we don’t have enough of that capacity and that focus of pointing our flashlight to the problem we’re trying to solve. At some point, we already are so far off from that.
We can ask teams, what is their experience? I wouldn’t start focusing on specific issues. Actually, a first step could be just ask the team, what can be better? Then ask the team, what is your experience to build this software? What is your experience testing? What is your experience operating, being on-call for this, troubleshooting this service? You can get a glimpse from the team and from the individuals in the team, what are the things that are difficult? What are things that are painful? One sign would be the things that they tend not to want to do very often because they’re painful. If it’s painful to deploy, then we’re probably going to want to let’s not deploy every day, because it’s so painful. Let’s do it less often.
Kerr: That’s the, if it hurts do it more thing. Because if you do it more, you’ll make it easier.
Pais: Yes, you’re forced to find ways to make this easier, so it’s not a pain. I think I heard that phrase from Jez Humble, and continuous delivery. We have an example, a forum where you can ask the team, what’s the experience like of doing these things that you need to care about for evolving your software? Then over time, you can see if it’s improving or not. If we know, cognitive load is too high to deploy, let’s say, it’s really painful. We’re going to try to do some things, maybe better tooling, maybe people following some training or what have you. Then maybe, let’s see, in three months, six months, has this pain gone down or not? Is the developer experience improving?
Kerr: This is like developer experience, in particular, whether people are having a nice time of it at work, is a big clue to your socio-technical architecture, of both your teams and your software, of, is it good? Is it getting better? Not, are they having fun playing Ping-Pong, and do they like the latest flavor of LaCroix in the fridge? Do they get into flow, and do they get to stay in flow?
How Much Open Source Adds to Cognitive Load, and When It Can Reduce It
Manuel, you used a phrase, reducing cognitive load over time. I think it’s naturally as we build software, we are inherently increasing cognitive load, because we’re always adding more things, we’re always making it do more. Working to continually reduce it is so important. How much has open source added to cognitive load? When can we use it to reduce it? Who has an opinion?
Starr: I feel like we should separate lack of familiarity from the experience of cognitive load of learning something, because there’s a different dimension of dynamic when we’re unfamiliar with something. With open source, we’ve gotten to this world where 90% of our software is made of third-party parts. We don’t build things from scratch anymore, we build on top of this mountain of things that are out there and available. We rely to a certain extent on the stability of those things, because of the eyes of the community and everyone else having to deal with the friction of upgrades and things breaking, and all that. Such that those libraries and things out there, we hope will generally be stable. That when we Google for something, the answer to the error message that’s on our screen, we should find that on Stack Overflow, hopefully.
The dynamics of how we go about doing development has significantly shifted in a couple different ways. One, of how we go about searching and finding answers and not just dependency aspect. Then, that 90% of the software that we build on top of is also things that are largely unfamiliar. We’re building on top of abstractions, and pulling all of this unfamiliar code into our code base. I think there’s inherent risk with this direction we’re headed in. I think about when the open source movement first started, and we had this very community oriented, we’re going to care for the software ourselves. We’re all going to be part of the community and build it together, and it’s going to be great, and people really did get involved. As time went on, we’ve shifted toward a mode of having an expectation that all this stuff would just be out there. As opposed to being a creator or contributor, we tend to be more a consumer of these things and relying on them existing to be someone else’s burden, but not our own.
Kerr: You can’t contribute to everything.
Starr: I’m concerned about those shifts from a safety standpoint of, what’s the consequence of us building on top of a bunch of things that we really don’t know how it works?
Kerr: We can’t know how everything works.
Starr: How does our familiarity with those things drift over time, in aggregate? As a community, are we becoming less familiar with the infrastructure that we’re all dependent on?
Yang: I think it’s not just open source, but the fact that people are buying off-the-shelf infrastructure, or people are using off-the-shelf APIs. It’s everything. Everyone’s living in a software ecosystem these days. I was talking with some Google old timers the other day, where they said, back in the day, there was one company where like, if you wanted to look at what kinds of new infrastructure was getting built? There was a small handful of companies at the cutting edge of building infrastructure pieces, and you could go there and train yourself to do it. Then you could go off and build infrastructure wherever you went next. Now, everyone buys off-the-shelf pieces for the things that they built by hand themselves, they knew how everything worked. We’re just living in a different era.
If we go back to the building analogy, it’s like, when everyone was building their own houses out of mud, you knew what mud you used, and the houses were pretty small. Now you’re getting random shipments of bricks, and all these other things from everywhere. We need to make sense of how it’s all going to fit together. I used to be very enamored with these building analogies, because there’s one guy I knew who used to give all these talks, saying, we can measure buildings, and we can talk about their structural integrity, what about code? I think code is something completely different, because the people are a part of the code. People are part of the buildings too, but it’s different. They’re not part of the structural integrity of the buildings. They are, too, but it’s still different. I do think we live in different times. I think that people are much more a part of the software processes and the code than any other analogy we can make. I don’t know what to do about that, but that’s just how it is.
Kerr: The people are part of the structural integrity of the code.
See more presentations with transcripts