Podcast: Intentional Culture and Continuous Compensation: An Interview with Austin Vance

Shane Hastie: Good day folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today, I’m sitting down with Austin Vance. Austin is the CEO of a company called Focused Labs. Austin, welcome. Thanks for taking time to talk to us today.

Austin Vance: Yes, thanks for having me, Shane. I’m really excited to be here.

Introductions [00:49]

Shane Hastie: So we normally don’t talk to CEO folks on The InfoQ Podcast because we want to talk to technologists, but I’m told you are deeply a technologist. So whose Austin?

Austin Vance: Yes, well I hope I’m still a technologist. I spend a ton of my time programming still. My firm is not massive and I kind of think the title of CEO is wonderful for a business card, but also probably CEO level decisions are not a big part of a midsize firm. We’re not making large strategic decisions all the time. Instead, a large part of my job is setting a bar for engineering interests and engineering consideration and excellence.

And so at one point I had gotten so far from programming, I got frustrated and I decided to live stream myself programming, started doing that about seven or eight months ago, and now I have 20,000 subscribers on YouTube where I livestream myself coding, and I grew up coding and talking through the code I’m writing so a lot of just random stuff. I’m like, “I want to experiment with this API or this new technology or do a RAG”, or something like that. So I try to code as much as I can, but also have a nice firm.

Shane Hastie: So one of the things that put us together was a conversation about developer onboarding. Why is it so hard and how can we do it better?

Challenges and opportunities in developer onboarding [02:09]

Austin Vance: A good question. I don’t know if it’s hard, might be my first reaction to that question. I think it requires intentionality and a lot of times I think when we bring people into a culture or a company, we are not intentional about how we want them to assimilate. And what happens is a company or group or a team assumes through some form of tribalism, series of notion docs or a wiki that someone will be able to figure out what’s going on and then the missing portions will be hit or will be covered through managerial one-on-ones. But the truth is a culture at any company, no matter the size, is the sum of all of the people that are at the company and have been at the company.

And so as soon as a new person joins, the culture shifts a little bit and it requires intentionality to drive however that person shifts the culture back towards what you want the culture to be as a leader or as someone there. And so sitting down and really understanding, first defining what you want your culture to be and what you want… And culture’ so many things we could talk about that, but what you want your culture to be and then how you can communicate that to a person so they can be the most proficient and effective is really like that’s it. It’s really not that hard, it just requires concentration or intentionality and sometimes we just get too caught up in the day-to-day that we aren’t intentional about bringing that person in.

Shane Hastie: So what does this intentionality look like and what is the experience of being part of that in intentional onboarding?

Intentionality in the onboarding experience [03:34]

Austin Vance: I’ve worked at a handful of places and I think the best I’ve ever experienced, hopefully besides my firm, but the best I’ve ever experienced was at Braintree. I ran a division of engineering at Braintree and I had a great time. I came in, I was the first of a new layer of management and they put me through the same onboarding as all the developers. So developers had some that was theirs where you talked about how do we do CI and what’s our testing culture and how do you do a pull request and feedback on that. So there was that kind of stuff. But then there’s also a bunch of onboarding about what it means to be at Braintree.

And one of the things I really loved about that is over the course of the first, if I remember correctly, it was about a month, you spent about an hour to an hour and a half, a couple times a week with a different person who’d been at the company, and they didn’t always have a big title, but a different person had been at the company for a little while talking to you about something that they loved or cared about deeply at the company.

And for engineering, like I said, it was some of the technical norms that you would expect out of the communication between teams. But then it was also what is meeting culture like, how do we communicate with each other? And then also just like what’s the history of the company? Who are we? What do we want to be? And I thought that level of intentionality and that level of time where it was different people, different opinions, all kind of speaking towards one mission and one kind of personality, which was the personality of the organization, really made you feel like you were a part of something early, early on.

And, I mean, of course I got great swag when I started and other stuff like all companies do, but I don’t think that swag made me feel as part of a new culture as onboarding did. We’ve copied that a lot. And so we over the course of a few weeks go through a handful of presentations, decks and conversations ending with a retrospective with all of our new hires on what we expect out of them, who we are, what our values are, what our operating principles are, what it means to be a developer or a salesperson or a marketer or anything like that at Focused.

Shane Hastie: You made the point that culture is the sum of everyone who is or ever has been in the organization and that you need to deliberately shift it to where you want it to go. What does that intentional culture again look like and feel like? What’s the experience of being in that culture?

Intentional culture design [05:57]

Austin Vance: I mean, the most human answer is when you’re at a place where culture is intentional, it feels good, right? It does. It just feels like things work. When you come into a place where culture doesn’t feel intentional, it can feel chaotic, it can feel misguided. When culture is intentional, what you see is you see common traits between all of your colleagues, coworkers, professionals around you. And it’s the work ethic, it’s the ethos that is the company. It’s not that you all have the same background or you all went to Harvard or you all have an MBA. That stuff doesn’t matter as much. It’s like you all approach work with the same level of rigor, the same level of thought, care, and even more so maybe the company takes the same level of care to how each other approach each other’s work.

It doesn’t mean that there’s a good culture or a bad culture. Some places might be super cutthroat, action oriented, top dog wins, eat what you kill, and that can be okay if that’s what the company wants. And if someone comes in there and they enjoy that culture, they will feel like they belong because it’s obvious what that is to them. And other cultures could be more like team oriented, collaborative, when we win, I win kind of thing, and when I win, we win. And other people might love that, but a person who’s a top tier non-collaborative performer might feel really ostracized or feel like they’re not getting the recognition they deserve. And they would maybe select out if the culture’s intentionally towards more collaborative or more the other way, a more winner takes all kind of culture or something like that.

Shane Hastie: So what is the culture that you’ve tried to instill at Focused Labs and how have you communicated that culture?

Communicating culture [07:31]

Austin Vance: It’s always evolving. Like I said, it’s a sum of all at the company and all that have been. And part of that is the people that come and go really shift the culture too. I never really liked values as a thing that companies had. I’ve worked for companies that had values and they essentially were Wi-Fi passwords, so why do we have them? Excellence or something like that. And people are probably really familiar in this podcast with the Netflix culture deck, Netflix has this great culture deck and it starts with values are not something you just put on a wall, they are the litmus test for who you hire, who you fire and who you promote. And that really hit me in a real way. And so really early in the days at Focused, I sat down with my team of, I think at the time we were five or six people and I was like, “Who are we? And then who do we want to be?”

So values can be slightly aspirational, but they can’t be so aspirational that they’re not true. And we landed on three and so our three values ended up being love your craft, listen first and learn why. A lot of times you see companies with values that are like one word, it’ll be like excellence, integrity, that kind of stuff. And we picked those three because we thought they were a little bit more action oriented. When I see love your craft, I want to work with people that love what they do. And in tech that’s so easy or maybe not so easy, but in tech we see it a lot. I’m very familiar. I came the Ruby community and I’m really familiar with the software as a craft kind of thing, but I want my recruiters to treat talent acquisition like a craft.

I want where they’re honing how and where and why and what they’re doing to find the best talent. And I want my sales teams to treat customer acquisition like a craft where they understand messaging and communication and empathy with all their customers and my support teams to treat support like a craft. And I want my design teams to treat design like a craft. Just every practice at my company should be treated like a craft from the top down. The leader should be a master craftsman where everybody else is learning from them. And the lesson first falls out of that where in order to hone your craft, you have to be open to learning.

And it’s funny, I’m on a podcast where I spend most of the time talking and not you, but my firm were consultants and I have a story actually from Braintree, it might be one of the more embarrassing professional experiences in my life where I sat down in a room really early at Braintree and I had a large organization and a big title. And I sat down in a room full of principal engineers at Braintree and I was on cloud nine because I was a big boy on the organizational chart and they were pitching moving Braintree’s APIs from REST to GraphQL for some specific things, and it was pretty early in the GraphQL days.

And I kind of was like, “Well, that’s stupid, GraphQL’s never going anywhere. No one even knows what it is”. And I kind of talked my way into sounding like an asshole. And I left the meeting and one of the principal engineers pulled me aside and told me that I talked my way into sounding like an asshole. And over the course of the next few weeks he was like, “I actually think that GraphQL’s probably the right choice. You should spend some more time learning about. It’s your first meeting, ever hearing about it. Maybe you should spend some more time figuring out what’s going on here”.

And over the course of the next few weeks, understanding how people use the Braintree APIs, understanding what was going on, understanding more depth around GraphQL, I came to believe that GraphQL is the right decision too. And I wish I had listened first. And so not that we can’t have opinions and shouldn’t, but I think one of the important decisions that we make as individuals when we communicate on teams is like, when do we listen and when do we share our opinion? And then finally learn why is at the very end of all that is this deep and entrenched curiosity and how and who we are.

And so coming back to your question, I think the most important thing about honing a culture is really knowing what you want it to be and then being able to really clearly articulate how someone embodies those traits and then constantly repeating that, that is the most important thing. And I really do believe that if you can articulate, repeat, and share that, performance follows really quickly, actually, like the results just speak for themselves.

Shane Hastie: No, I know from our conversation before we started recording that these values are one side of it. You also then have something else that makes it more concrete.

From values to operating principles [11:52]

Austin Vance: Yes. So values, they’re still a little high level I guess. And so what we wanted to do, and this is specifically for our engineering teams, but we’ve defined a series of operating principles and these operating principles, I kind of think of them like you have the Boy Scout motto and then the things that make a Boy Scout, a Boy Scout like thrifty, kind, clean, reverent, those types of things. And our operating principles are kind of like the things, I forget what part of the Boy Scouts it is, but they’re like the things that make a Boy Scout, a Boy scout. And so we try to define a little bit more clearly what it means outside of a value, so a personality trait or what you should be doing.

And so one of my favorite values that we have is be exothermic. And so since we’re an engineering world, I get so much feedback on this operating principle because everybody’s like, “What is exothermic?” But since we have a lot of nerds on this podcast, I think most people know what it means. But it means when given energy, you create more heat than you have or you’re putting out heat. And in a professional world, what I think that means is bring energy to the situation.

In Wedding Crashers, they say fit in by standing out. And so the way you come into a meeting is not by being quiet in the corner but come with a passion and energy and a charisma that makes other people want to care as much as you do. And that becomes infectious. And so we have these series of operating principles that are guiding principles on how everybody can interact with their customers and each other, and that’s really important.

Shane Hastie: I know that you also have some pretty unusual approaches to compensation. Can we dig into that?

Continuous compensation management [13:26]

Austin Vance: Yes, I absolutely despise the way traditional compensation is managed. The majority of my career, before I started my own firm, I worked at big companies, big, big companies or small companies that were acquired eventually by big, big companies. And so we dealt with compensation in a really traditional way. And if people aren’t familiar, the way compensation is traditionally managed is some sort of raised budget is designed for departments and the whole company.

And that gets kind of passed down through tiers of upper and middle management, distributed in some kind of finger in the wind sort of way based on performance investment and being reasonable based on a 2% raise generally for the mass set. And then on the other side of that, a manager and peers do reviews of each other and they do some form of meets expectations, exceeds expectations or falling behind. And then this manager has a raised budget and they get to dole that out to say they’re 10 or so direct reports based on meeting expectations or not, and it happens once a year. That’s the last thing. Happens once a year.

And I’ve always found that to just be a absolutely horrible way to manage performance. I have spent so much time in my life sitting in rooms with leaders at enterprises convincing them that they de-risk their software by releasing often nightly, daily, weekly if you want to de-risk the release of software, which means that if there’s any failure, you can fix it quickly or we can manage it quickly the same should be true for our people, by releasing often. And a big way we release often or a company releases with its people is by rewarding the people that are being excellent and managing the people that are not proactively. And the easiest way to do that is through not or by giving or not giving raises to people. If I find out only every 12 months, whether or not I’m doing well or not, it’s probably a pretty bad way to live and bad way for the company to handle reward and promoting.

So the way we do it, we run a raise cycle every 6 to 12 weeks for every single person in the company. So we have managers collect feedback constantly through one-on-ones. We have peers, we do daily pair review, feedback sessions, that kind of stuff. We have well-defined job description and growth pathways as you move from senior to staff engineer or junior to normal engineer or whatever. And every 6 to 12 weeks we say, “Where’s this person fit?” And we think they’ve gotten really good, they took a leadership position over these set of features with this customer, they’re actually performing now at this level, and we give a small raise.

And that raise doesn’t have to be thousands and thousands of dollars, instead it could be a few hundred or a thousand dollars or something like that. But if you get that every six weeks, it compounds throughout the year as a good job, a good job, a good job, you’re on the right track, you’re on the right track, and you watch over the course of the year, your compensation increases dramatically, but you’ve also continued to get pushed in the right direction, “Here’s what you’re doing well”.

Also allows us to correct really quickly for anyone who’s not performing, “Hey, we’re not going to give you a raise this time because we were really seeing you disengage. You’ve been showing up late a lot, it’s been really hard for you. You’ve been cameras off and not talking to the customer. You had a feature that you were stuck on for a few weeks and you didn’t communicate that”. Whatever it is, let’s get that fixed and in six weeks they fix it and they come back, they get a raise, right?

Compensation follows so you can actually correct behavior more proactively and bring employees back who might be lost versus I’m kind of sitting in this negative cycle for at most 12 months and then being lost. There’s a really positive way to manage people is through compensation and all the conversations that surround that compensation change. And so that’s how we do it. We continue to hone it. We tried to do it every four weeks. That was way too much overhead. We’ve found somewhere between 6 and 12 is right.

Shane Hastie: Completely different to I’m sure most organizations and potentially disruptive. I wonder you’ve made the point Focused is a relatively small company at the moment. How do you think that’s going to scale?

Can this approach to compensation scale? [17:41]

Austin Vance: I think it scales phenomenally. And the reason I think it scales phenomenally, and I get this feedback a lot, but the reason I think it scales phenomenally is management at scale is the formulation of abstraction layers over complex people systems, right? The same way we create an interface to talk between services, we create interfaces, which are managers, to talk between teams doing individual and important things. And if through culture, training, onboarding, onboarding of managers, everybody understands the value of the lean compensation model or continuous compensation model, then at each level the managers really understand what’s going on and it scales horizontally fairly simply.

The harder part about it and what has been interesting through the tech cycle recently is we anchor the midpoint compensation for each of our titles in the market average, well the 60th percentile or 65th percentile of the market. And so over the course of my career, the 65th percentile of the market has only gone up until about three years ago. And so watching how the mean compensation for a staff engineer has changed was always like, well, you could stay staff engineer and your comp could continue to go for the last 15 years, but then the last three, maybe it plateaued or maybe even a staff engineer off the street would make a little less than someone who was hired five years ago.

And so managing those conversations and understanding that has been a really interesting and really difficult part of the lean compensation, but it’s actually done really well because helped us proactively talk about whether or not the talent that we have has stagnated or is growing. And then the people who maybe are sitting at the same level they had been for a few years, how can we push them to grow more or is that the right talent for the firm?

Shane Hastie: Another thing that I know you have opinions on is the role of management. What is the role of a manager in a tech organization?

The role of the manager [19:41]

Austin Vance: Well, hanging off of compensation, I think a lot of places I see management fit into two buckets. One is very paternal style management where you do a lot of like how are you feeling at work kind of thing and then there’s other places that do a lot of performance style management. Are you hitting your numbers and your metrics? I think management can be both. But I see often organizations lean one way or the other. The first thing I’d say is at Focused and me in general, I believe that management, in order to be an effective manager, you must control the compensation of your direct reports. It is the primary means in which you communicate performance to them.

You cannot be a good manager or you cannot be an effective manager saying, “You’re crushing it. I’m so happy, you’re doing so well, you’ve grown so much, but I’m only able to give you a one and a half percent raise this year”. Because if that’s the case, all of a sudden what you’ve done is the manager is no longer representative of the firm. They’re just a friend of the person and they created a common enemy, which is the company.

And so to me, what management is, is the person who is most responsible or solely responsible for creating predictable attrition inside of their team. And so what I mean by that is their job is to retain the best talent and understand why they are being retained and to exit their poor-performing talent proactively and understand why they are being exited. A poor-performing manager is someone who has people leave their teams and their groups without being able to predict it and or does not remove people from the team who are not high performers. And so that is it. That is a manager’s sole distilled job, but all this stuff falls out of that.

So people are like, “Well wait, so my only job is to fire and give raises to people?” No, no, no, no, no. Your job is to retain the best talent because of the right reasons and exit the correct talent for the right reasons and know when that’s happening. And so how do you do that? Well, your best talent needs new challenges. They need new opportunities. They need growth. They need compensation. They need more leadership. They need maybe a new tech stack. Your most high potential talent needs coaching. It needs one-on-ones. They need mentorship.

Your bottom-performing talent needs direct feedback, needs, needs active management, need performance improvement plans. Like all of the stuff that managers do centers and boils down to I want to keep or exit to the right people at the right time. And so that’s how we train our managers. But all of the tools that we learn about in the manager tools, books and podcasts and all that stuff, those are all for that and we should be using all of those tools to do that. That’s my take on management at the end of the day.

Shane Hastie: I suspect that there’s not many management training classes that teach you about that.

Austin Vance: I think they hide from it. They don’t, and I think they hide from it and we take it head-on in our first management training materials. We say that and we say it’s not heartless. I was sitting around at dinner yesterday talking about this, and people are like, “When I finally have made the decision to let someone go”, the whole team is like, “Wow, why’d it take you so long?” And managers often are the last to make that decision because they don’t have the proactive conversations with each other about how to curate and cultivate the appropriate talent. And no, I don’t think they approach it directly.

Firing is one part of it, but firing is a very dramatic and last resort answer to a lot of other failed things. An interview failed, coaching failed, one-on-ones failed, performance improvement failed. But those things had to be tried and eventually it’s kind to allow someone to go someplace else or makes it so so they can go someplace they can be successful because normally they’re not because the culture’s not right, not because they’re bad.

Shane Hastie: Austin, a lot of deep and interesting thoughts there. If people want to continue the conversation, where can they find you?

Austin Vance: I am pretty active on social, so you can find me, Austin BV on LinkedIn, X, all over the place, but any social with that handle is me. So reach out, please. I’d love to talk to you.

Shane Hastie: Thank you so much for taking the time to talk to us today.

Austin Vance: It was an absolute pleasure. Thanks for letting me ramble.


Fodera: I’m going to be talking about how CarGurus leverages our internal developer portal to achieve our strategic initiatives. I want to know, how many folks actually know what an internal developer portal is? An internal developer portal is really like a centralized hub that allows us to improve developer experience, developer efficiency by reducing a whole bunch of cognitive load. It’s usually internal, but the whole point is to centralize information into it.

We’ll talk about launching Showroom. Showroom is our internal developer portal that we built at CarGurus. How we actually achieved critical mass of adoption with our internal developers. Then, the foundation that we really built that helped us to set ourselves up for the future of our strategic initiatives that we’re trying to accomplish.

My name is Frank Fodera. I’m a director of engineering of developer experience. I’ve been at the company for about six years. My background is primarily in backend development, architecture, and platform engineering. Sometimes I’ll jump into the frontend as needed. I always found myself on customer facing teams, but CarGurus really helped me find my passion for improving the developer experience. That’s currently where I’m at. I do love staying technical. I found enjoyment in coaching and helping others grow, as well as achieving their strategic vision. Really like making a wide impact at the company, so I unexpectedly started gravitating towards leadership roles.

Developer Experience (DevX) – Our Mission

Before we jump into the actual tool, I want to talk a little bit about developer experience and what we’re trying to accomplish with our goals. Our mission statement was really to improve the developer experience at CarGurus by enabling team autonomy and optimizing product development. We do this in a couple of different ways. We have an architecture team that really invests into making sure that we have scalable architecture system design, and they do that for both the frontend and backend.

We have Platform as a Service team which really invests into providing a platform offering which provides a great developer experience for the developer workflows, our environments, and really all the day-to-day that we’re doing with how we’re working. We have a pipelines and automation team which focuses on build and delivery and how we actually get everything into production, and then all the quality gates that we’re investing into in order to make sure that we do that very seamlessly. Then we have our tools team, which is really the focus of this talk, which is internal developer portal, and the internal tools that are helping us improve that developer experience at our company.

Launching Showroom

Launching Showroom. When we first started Showroom, it really was just a catalog. We knew we had a problem, but we found that it eventually evolved over time, into what we call an internal developer portal, so it’s our homegrown solution. In this presentation, we’re going to talk a little bit about what problems justified why we created this ourselves. What outcomes did those solutions provide? How did we actually achieve critical mass of getting people to use this product? Then throughout, we’ll talk about a lot of these strategic initiatives that we piggybacked off of in order to invest into this product. Then at the end, we’ll wrap it up with a little bit of a foundation that we really created that helps us to continue to invest into this, as well as leverage this to move faster on achieving those strategic initiatives.

Our journey really started in 2019. Many of the talks have talked about tech modernization, or trying to optimize the way that we’re doing stuff into the cloud, and that’s where we started with our journey. We wanted to start moving into microservices. We called it our decomposition initiative. Our monolith was starting to get slower to develop in. We wanted to actually develop more features, but we found it difficult. We knew something needed to change. We needed to transform the way that we approached our development. Thus, we embarked on our microservice journey. Our problems that we had were that we knew we were going to have hundreds of services. We already had thousands of jobs, and ownership was unknown, and some of them were even unclear.

One of the engineers on my team actually said something in our Slack, where he said, everything that has no owner is going to eventually be owned by the infrastructure. We were an infrastructure team, so it was definitely a motivation for us to make sure that we had clear ownership, because we didn’t want to end up owning everything. Production readiness was another problem we had. We didn’t always know what platform integrations, if we were ready for production as we were trying to introduce these new services. There really wasn’t much transparency into that. Overall, we found that it was very difficult for us to create new artifacts. It was a very heavy platform investment. It took us a lot of handholding to get it across the line. It was very time consuming. We knew that there were some big problems that we wanted to solve.

This is actually what our monolith looks like. We actually bought a tool to try to go and look at our dependency graph within our monolithic architecture. We had all of these packages, all these modules, and we had almost no restrictions on how they could actually call each other, and we ended up with a big ball of mud. We knew we were dealing with something that was pretty bad, and we knew it was a difficult challenge for us to actually go and solve this. We thought our architecture looked like this.

We thought, yes, we have this monolith which has this frontend, it has all these packages it can call, and then it relies on a database. Nice and clean. That wasn’t the reality, as we just saw. What we were targeting was actually what we call these vertical slices. These vertical slices had everything that it needed from the frontend all the way to the database to really do what it needed to do, but it really only depended on the minimum set of things. We wanted those to be more isolated. We wanted to go into that microservice architecture, and provide a more decoupled way of operating.

We also knew that we were going to have a lot of these services getting created, so we prepared for the service explosion by making it more clear who owns things, but also enforcing that we had registration and ownership. We started very basic. We started with our catalog. We started with our service catalog here. It was a JavaScript React based frontend. Developers could go in there, see what things were owned, pulled out a whole bunch of information for them. They needed to contact the team, they even had a link to the channel in Slack to go and talk with them. We had an API layer that was REST based with good documentation from Swagger. It was talking to a containerized Java Spring Boot application, a service that was running in Docker and Kubernetes. Then, at the end, we actually had a MySQL database in RDS.

That really isn’t anything special. It’s pretty basic. Allowed us to catalog things, but really didn’t provide anything other than just centralizing that information. Where the big secret came in was this concept called RoadTests. RoadTests was something that we introduced into our CI system, which is Concourse, and it ran when you actually opened a PR. What this did was it used one of the APIs on our cataloging system and said, is this new artifact that you’re generating in our monorepo? We actually have a monorepo, so that worked to our advantage here. Said, is that service actually registered in our catalog? We use the concept of canonical names.

Those canonical names enforce that, if it’s not registered, we’re actually going to block your PR from getting merged in. You need to go into the catalog, register your service. We made it easy. We didn’t want to actually add too much overhead to developers, so it was a few button clicks. You could register it right there. This actually helped us to maintain and enforce that ownership as we were continuing to develop hundreds of new microservices.

We talked about jobs. Jobs was another issue that we had where we actually did have a whole bunch of jobs that were cataloged, but they were spread across four different instances between regions or environments. They were in this system called Rundeck, which is where we ran a lot of these batch jobs. What we did was we leveraged the Rundeck APIs, and said, we’re going to take a different approach. Instead of actually having these manually be added in as individuals were adding new ones, we already have a system that has all these, but you have to remember which ones to look into. We scheduled a nightly job. We used the Spring Batch framework within our code base, had a few APIs that pulled out the various pieces of information that we felt like it was going to be relevant for our developers.

On a nightly basis, we had a sync. It just synced all of those, put them into our catalog. Really what helped us with this is that we actually developed a way to intelligently classify our ownership. We did it with pretty good accuracy. We had about a 90% accuracy rating on how many we were actually able to classify the ownership with. Then those just ended up in our catalog. If we weren’t able to, we still have this banner at the top, which, as developers were going into the UI, they could go and say, there’s some jobs that actually don’t have ownership. Maybe I should go and classify it. Click on that link, and then they could see all the ones that don’t have owners, and they can manually claim them.

What did we accomplish with this? We knew our problem, services and jobs were unknown. We had unclear ownership. What were the outcomes? One-hundred percent of our services and jobs were now cataloged. We had zero undefined ownership, meaning infrastructure doesn’t own everything now. Service registration was enforced and job catalog was automatically synced. This is where we really started with our first two pillars. Our first two pillars of Showroom were discoverability, the ability to have that consolidated critical information in one place, so our services and jobs catalog. Then our governance, our RoadTests. Those RoadTests allowed us to ensure that we were continuing to maintain that ownership as we were continuing to develop at a pretty rapid pace.

Achieving Critical Mass

That didn’t help us to achieve critical mass. That was all great. It’s a great foundation. If you’re not looking for information about what services or jobs or owners, you’re not really going to be enticed to go into that UI. What did we have to do in order to achieve critical mass? We focused on another problem, still within that decomposition initiative. We had manual checklists that people were going through, either Excel or wiki docs, where they’re saying, have you actually done all of these checks? Are you actually going and incorporating all of these different things before you bring your service to production? We said, we know we can do better. We use that same Spring Batch framework. We introduced something called compliance rules. These compliance rules ran on a nightly basis as well.

It would check things like, have you actually documented your APIs? If you’re in a separate repository, are you using the right repo settings? Or, if you’re in the monorepo, are you actually in the right folder to make it easy to find where your service exists? Is your pipeline configured appropriately? Are you reporting your errors in all of the different environments to our Sentry system? Are you using sane memory defaults or memory settings for Kubernetes? What’s your test coverage like? Are you actually reporting test coverage? Where are your metrics going? Do you have metrics? Are you flying blind on observability? What really made this powerful was the fact that we made it extremely pluggable, so anything that had an API, you could go and easily extend this and introduce a new rule.

That made it so anybody, our own team who’s maintaining this, and even external developers to our team, were able to introduce these as we were going through and trying to come up with things that we were considering more golden path, best practices, things that you want to make sure you’re actually designing for and incorporating before you go into production. This really helped us to focus on that standardization, so now we knew who was actually following the best practices and who were not. You got this compliance score right in this UI. What we found was our developers cared about the score. They thought of it as a little bit of a game. They wanted to get to that 100%. They wanted to get to the high green 90s in here, and that helped to bring a little bit more traffic to our product.

What did this actually accomplish? Integrations were now transparent. Services were automatically scored. No longer did we have to keep these checklists. Production readiness was seen upfront. You didn’t actually have to do this after you were in production and check everything. You got to see this right as you were introducing the service, and the first time that this service was actually going and being run in production hardware. This was an enhancement to our governance pillar. Really making sure that as we’re developing new features into our internal developer portal, are we actually investing into things that we feel like should be there. We were like, yes, this made sense. It’s governance. It wasn’t as much of a hard enforcement as the RoadTests. You still were allowed to merge in. You still were allowed to go into production. We found that this was very useful, because our developers cared about making sure that they could observe their systems, that they had proper logs, as they were going into production.

Next, we started off with a feature called workflows, and this was all about self-service. We had that problem where it was taking very long to get our services into production, and we wanted to make that faster. What we did was we introduced this concept of workflows where it consisted of steps. Propose your service at the beginning and bring it all the way to production in an automated fashion. What we would do is use a Spring Batch framework here as well, so that way you can keep track of the progress as you’re going through all of these different steps, because there’s a lot of things to do as we’re going through here. We’d start off with collecting information. You no longer had to worry about saving your service into the catalog, because this automatically did it for you. If you needed approval from your manager, if you wanted to make sure your service canonical name was actually accurate and that you weren’t going to change that, they would check it upfront.

If you needed additional approvals, we can go and incorporate that. Then, best of all, it would notify others that this new service is being created, so it provided visibility into all of these new services. You want to move forward. Your stuff looks good. You get your approvals. We provided some templates to go and make it a little bit easier, so you didn’t actually have to worry as much as time went on about those platform integrations, because our templates provided a lot of those out of the box. If you’re using our best practices, using our templates, you get a lot of those features automatically. We cloned that template for you. We go and update those variables, set up your development environment, and say, it’s ready to start using it. Start testing it out, make sure it works as you want. Then you can move forward when you’re ready. We’re ready to move forward, so we go into our staging environment.

We automatically generate that pipeline for you. We verify that that pipeline is going to be successful, and we let it deploy into staging. We sync a whole bunch of data, and we’ll talk a little bit more about why that’s important later on. Then say, yes, go and start using staging. Make sure everything looks good. Then when you’re ready, come back and move into production. They come back and say, is your service going to be P1? P1 is priority 1, has a little bit of additional checks that you’re going to need to introduce. If so, if the person tells us, yes, we believe that my service is going to be P1, we’ll add that label for you, and it triggers off a whole bunch of other process. We’ll verify that the production pipeline is set up. We’ll use ourselves to actually deploy your service into production, verify that it’s up and running in Kubernetes. Then if you told us, I needed a database, we’ll actually go and create a database schema for you. Then notify everyone, this brand-new service is in production.

What did this accomplish? We knew that it was complex to introduce new artifacts. We know it required heavy platform integration, and it was very time consuming. We brought service production time from 75 days to introduce a new service down to under 7 days. It was completely self-service: minimal handholding, no tickets. Nobody needed to depend on another team just to get your service in production, fully self-service. It was really great. Helped with developer happiness. You could innovate faster. You can introduce your services into production and get them right and rolling. This is our third pillar. We said, self-service. We wanted to make sure that our internal developer portal actually invested into team autonomy. We said what our mission was: team autonomy, productivity. This allowed for faster iteration. This was our third pillar here, self-serviceability.

Our next main initiative was our data center and cloud migrations. A bit unexpectedly, we ended up having to migrate out of our data center in 2019. However, we knew that wasn’t our long-term play. We knew we wanted to be in the cloud, so we ended up doing a lift and shift model into a new data center to get us there. Then we lifted and shifted again into getting us into AWS. That way we had that time in between 2019 and 2022, when we finally moved to AWS, to really prepare ourselves to do that. Some of our problems that we faced were, we were going to be changing host names quite often, because we were going from data center 1 to data center 2, and then eventually into the cloud.

Our developers used a lot of these bookmarks and things like that, which would help them find their services, but that was going to become stale very quickly. We knew that was going to be a problem. Deployments were very error prone, and we were now going to be deploying in twice as many regions, across multiple data centers. We were going to experience even more issues with human error and actually causing deployments to be complicated. Then, what we realized from our data center 1 to data center 2 migration was we really lacked the ability to have more dynamic and configuration management because host name changes were actually complex. We invested into that in between our second data center to our cloud.

First, we started off with data collection. I talked about bookmarks. Our data collection feature is essentially a centralized bookmark repository that’s visible to everybody. What we did, we leveraged that Spring Batch framework, kicked off a nightly job, or in this case, you could actually run it on demand. You click into a service, into what we call our service details page, and you would go and find that it would collect all of this information for you. Where is my pipeline found? What’s my repo, or what folder am I in within the monorepo? Are there any associated jobs that are connected to the service that I should be aware of? Where do my Sentry errors go? Where am I reporting metrics to? How do I find my logs for all of the different configurations that I have, for all of the different environments, the regions? Where can I find that? What are my host names, my internal host names that I can use to start testing?

Once again, we made this very easy to extend so really anything with an API, we can go and start collecting this information. That wasn’t all. We could collect a whole bunch of information in an automated fashion, but we also had the ability for individuals to go into this page and add some custom bookmarks. Maybe there’s a really important documentation page that we wanted to actually have. What this allowed us to do is have those developers pay it forward. Next time your teammate was looking for that critical information, or a runbook, or what happens when the service goes down, you could have a link to that actual page that goes and tells you, here’s how you can go and start triaging things. Your mind is not really all there when you’re in this emergency situation, but if you know you have the centralized place to go to, you can find all of that critical information. It was really very helpful for our developers.

What did this develop? We knew information was quickly becoming stale. What were the outcomes that we were able to accomplish with this? We automatically collected thousands of relevant links and provided it all in one spot, relevant to the specific service that you wanted to look at. We had all these services, thousands of links, you could search, filter, find which exact one you wanted. It was extremely helpful. No longer had to bookmark things. No longer had to worry about remembering the syntax query for which log statement you were trying to find. It was all just there for you. This was our fourth pillar, transparency. Providing transparency in a single pane of glass for awareness and visibility, and data collection was our first feature of it.

Next was deployments. Deployments, we talked about a lot of human error. What we found was that when people were trying to roll back, sometimes they chose the wrong build. When people were trying to choose build, sometimes they didn’t check if it had actually passed all of our integration tests or all of our different checks that we had, in an automated fashion. What we did was we integrated with GitHub to get all the list of the builds. You could even view your impactful commits. It got even more complicated in monorepo, because in monorepo, when you’re making a commit, it actually needed to be intelligent to know which artifacts are you impacting. We actually had a very intelligent way to determine that. Now developers could click this link, see exactly what changes they’re going to be deploying in that build, less likely to deploy something that they didn’t want to.

They could very easily check integrating with our CI system, has it actually passed all of the different checks? Alex talked a little bit about CarGurus concept called Pit Stop Days. Pit Stop Days is where we have one day about a month, where we really allow developers to brainstorm new ideas and innovate. The funny thing is this project actually started with that. We brainstormed this particular feature and said, this is a big problem. We know that we could do better eliminating this human error, so we invested into a design. We got a team together in our next hackathon that we had, we actually invested into this. We talked about the value that it was going to be providing. We talked about how it could benefit our strategic initiative that we’re investing into. It was extremely successful in the hackathon, we actually got a very functional prototype. Then we were given that buy-in as part of the strategic initiative to go and invest into this.

A developer comes in here, hits deploy. What happens under the covers? Once again, we use Spring Batch, but this time it was a little bit more complicated. We now were living in an environment where you had monolith and microservices. Depending on which one you were in, you actually would use a different system to deploy. You would deploy either through Rundeck or deploy either through Concourse. To the developer, it didn’t matter. We were able to completely abstract that away and provide the exact same developer experience to them, regardless of whether you are working on a monolith, working on a microservice, and then later on, even working into a separate repository. We provided a lot of convenience features. You wanted to see your logs, you know you were deploying a canary server for this service at this time, with this build, that log link dynamically generates it for you. You could see the logs right in the UI.

If anything was looking bad at the end of your deployment, you’d have this rollback button. It’s as simple as that. You would just click, roll back. Picks the build for you, knows exactly what build you were previously on. If everything looks good, you make sure your service is up and running in Kubernetes, you could go and proceed forward. We also had this culture at CarGurus where developers really wanted to get notified through Slack. We have a very Slack heavy culture, so we also integrated with notifications, where, as you were progressing through the various phases, we would notify you with good notifications, custom that we’ve made in Slack about what commits, who’s impacted it, how many commits are going out, with a link back to our service to actually go and help us.

This not only had people familiar with their Slack workflow, but encouraged them to come back to our UI to look at this as a visual status, rather than as a Slack notification. What did this accomplish? We actually eliminated human error during deployments almost wholesale. We found that there was almost no human error because we were able to design it away. Saved us about 7000 developer hours just in the time that it was launched. Really a huge success, and something that we’ll talk about why this was so critical to achieving critical mass.

This is our last pillar. The last pillar was operational efficiency. We really wanted to minimize fragmentation and cognitive load. We didn’t want developers to have to remember to go to all of these different regions to deploy. We really wanted to make sure that they were deploying by just clicking a button. They had the commits that were going out. They knew that upfront. They got to choose which commits were going out, rather than just blindly picking one because it happened to be the latest version, which may or may not have actually passed its checks. Then we provided really good ease for the log statement, so that way you could actually see those right in the UI.

We talked about configuration management. We knew host names were going to be changing. We knew that it wasn’t easy to actually manage these. We went through that painful process, through our first data center migration. What we did was we introduced this concept of configuration management. We introduced a service, primarily through CLI called Mach5. We like our car puns and naming. We had this Mach5 service which did really three things. It managed our environments for us, automated our dev deployments, and actually staging and production, so that way we contained parity. Then it introduced the concept of configuration management, both static and dynamic.

We introduced this UI within our internal developer portal that allowed developers to go in and change their configuration for the things that were static, things that weren’t going to change across the different environments, the different regions that you were running. This was just injected right into your service as we were starting up. The Mach5 service handled that for you. Then we also had dynamic configuration. The dynamic configuration really took away the whole need to even know or care about what your host name was going to be for a particular service in an environment within a region. It just said, I’m deploying in North America. I’m deploying for my production environment. I depend on service x, and it automatically knows what the host name is for service x. Completely obfuscated that away.

For our development workflows, it provided us this opportunity to make it so we had development, staging, and production all deploying in Kubernetes in a very similar way, and having environmental parity as close as we could get. We didn’t get perfect environmental parity, but as close as we could get. That allowed us to find a lot of these issues that prior to this were just, it worked in staging, it worked in development, but now it didn’t work in production. It eliminated a lot of that. Then these environments were fully managed for you, so you didn’t have to worry about it. That’s where we created the second feature here, which is the environmental visibility.

Because Mach5 was primarily CLI based, we had the ability to show visually, what services do you have deployed in your personal namespace? What services does your service depend on? You can click a button, click into your service and say, I’m experiencing an issue with my service right now. Is it actually some of my code that I wrote, or is a service that I depend on actually having an issue at the moment? Visually you could see, there’s a big red dot right there. That service is probably having an issue. Let me actually click into that service and see who owns it, so I can go and talk to them and see if their stable version that I’m depending on is not stable at the moment.

What did this accomplish for us? We had now proper configuration management enabled for all host names to be dynamic. Our static configuration was all centralized, so it was one place. We actually eliminated the pull request process. It was fully self-service as well. We launched three successful migrations, one for North America, one for EU, and then, once again, into AWS. Three successful migrations, which is a pretty huge feat. We didn’t miss anything because we had everything cataloged and we knew what we had to lift and shift. That also was huge. This is where we enhanced two more, as we were going through and developing these new features, we constantly had to ask ourselves, did it align with what our vision was for this internal developer portal? Yes, these did. It provided a self-service ability to configurations. It eliminated that manual process that we had to do to go and approve your pull requests. It provided transparency into your service, so you actually had now visibility into your environment, and you knew exactly what was running, what you were depending on, if that service was having an issue. We provided more insights to those developers.

One of the more recent initiatives that we launched, and we primarily were operating in a monorepo, but we really wanted to move to what we called multi-repo, so multiple repositories. It was 2022, we were officially in the cloud. We had many microservices at this point, but coupling remained to be an issue. We were like, I thought microservices were going to solve everything for us. No, that’s not what happened. We actually did make a good amount of attempts to go and ensure that proper build and compile time was isolated, but it was still proving difficult while everyone was really operating on a single monorepo. We found that more microservices made it difficult to find real-time information. We couldn’t easily create new libraries and repositories. It was complex and time consuming.

Overall, most importantly, it was proving to be very inefficient from a build, deploy, and development perspective, to operate in a single repository. We had this architecture. In monorepo, we introduced this concept that we called embankment, which was really trying to mimic a multi-repo environment in a monorepo. It encouraged us to prevent ourselves from having dependencies on what we called mainline, which was where all of our existing artifacts had lived. It also allowed us to introduce reusable libraries that were more properly versioned. It wasn’t good enough. We wanted to move to what we call this multi-repo, where you had each artifact being produced out of one repository. Those artifacts could depend on each other. Then you could depend on a whole bunch of reusable libraries that are properly SemVered.

We had this real-time information. Services were now further spread apart from each other, and we really wanted to provide a single pane of glass for all the information that you needed. We said, let’s go and integrate with Prometheus to find how much CPU memory your service is using. If we could find your service in Kubernetes, we’d go and tell you how many pods you have. What the status of those pods are. Are they restarting? We also wanted teams to feel empowered to improve their workflows and get more efficient. We actually implemented all four DORA metrics and provided visibility wholesale for all of the microservices, so now they could see what my change failure rate was, what my lead time for change is, and so on. We also allowed you to see, are you on the latest build?

Did somebody check in something and then nobody really remembered to deploy it, because we still were deploying manually. We also cared very deeply about security and quality, so we integrated with our security system and our quality system to go and provide that all upfront. Now you could see, do I have vulnerabilities? Do I have good coverage? Are there bugs that I have that I could go and fix? Really centralizing all of this. Now you had health, DORA metrics, build, security, quality metrics all available and easy to find, all in one place. This once again aligned with our pillars here. We had our governance, which was security analysis, allowed us to go and have that visibility more upfront, shifting further left. We had those statistics which was all about transparency, really reinforcing what we had.

Then we had a reusable workflows framework. This looks very familiar. We had steps and tasks, but what we learned this time was that we could generalize it, so we invested into refactoring this. We made it more robust, easier to extend by having these options to really introduce and ask any questions, collect any information, catalog that information, seek any approvals, execute any type of thing, notify. You could verify that things were actually set up successful. Really, you could run any arbitrary task, and that allowed us to introduce a whole bunch of new workflows for self-service, things like creating a new library, creating a new backend service directly in multi-repo.

New application, that’s what Alex’s talk was all about. All about creating those new applications in Remix. We actually even provided an ease of use to say, during a certain period of time, we’re going to allow you to have an automated way, as automated as we could, to migrate yourself to multi-repo. Then, if you forgot to actually say that you needed a database, or you didn’t know when you were introducing the service, you can now self-service that at any point in time and create a database or create a dashboard for yourself.

This allowed us to actually further enhance that time to production for services and libraries. We went from that 7 days down to 2.5. Library were manually created, taking about an average of 10 days from start to publishing it. We saw that it was taking one day now to introduce a bare bones library that was published. Really, some huge successful wins. Once again, going into our discoverability pillar, really helping us to go and invest into that cataloging. We introduced library cataloging, and then actually team cataloging too. Overall, though, I do want to talk a little bit about what this multi-repo project provided from an outcomes perspective. We helped to accomplish a lot of these with this internal developer portal.

A lot of work was put in across all of the engineers at CarGurus to really invest into this tech modernization. What did we accomplish? Our lead time for changes went down by 60%. Change failure rate dropped to 5% from 25% for our monolith. Build times were 96% faster because we didn’t have to worry about that centralized pipeline that was on the monorepo. Deploy times were on average, 70% faster. Best of all, we found that our developers were actually 220% more efficient. They were happier because they were able to move faster, accomplish more, less roadblocks, less overhead, very powerful for them.

Foundation for the Future

Now we had a foundation. We had these five pillars. These five pillars really allowed us to continue to invest. You’ll see, I’ve added a few more here that I haven’t talked about in our talk, but we had a whole bunch of different features that we’ve invested into that really kept aligning with these five pillars. We’ve stayed true to that to continue to do that. We’re not just adding everything because we have a centralized portal, but because it aligns with our mission and our vision of providing that autonomy and providing that productivity boost for our developers.

I also want to talk a little bit about what we’re currently working on. We’re currently working on another initiative called time to market acceleration. The problems here is, yes, it’s 60% faster to get services into production, but it still takes days from commit to production. Quality issues are often found too late in our development cycle. Deploying to production is still manual. You still have to click a few buttons to do it. We plan on heavily leveraging our compliance rules to determine if you’re actually ready for CI/CD. We plan to leverage labels within our catalogs to track the migration of who’s moving over to this new full CI/CD model, which is the goal of this project.

Then continuing to provide a great experience by having that single pane of glass, regardless of what type of deployment model you’re in. The outcomes we predict here are getting our lead time to changes down to under 60 minutes. Maintaining a change failure rate of 5% despite moving multiple times faster. We’re hoping to lower our defect escape rate. Then have an improved developer efficiency by eliminating that manual deploy step that we have to do at the end.

Secondarily, we talked about how we lifted and shifted into the cloud from our data center. Not great, but it helped us do it very quickly and very successfully. Another initiative that we’re launching is cloud maturity. We’re operating fully in AWS, but we’re not really fully leveraging all the offerings that we have. Our services are not actually always built with the cloud-first mindset, so we can do better there. It’s actually difficult to understand the cost implications. It’s hard to understand the cost implications of a design decision now that we’re operating in the cloud. We plan to use our catalog to know what’s available, so you can reuse stuff instead of developing net-new. We plan to invest more into our workflows to help us self-service infrastructure provisioning, making it easier for developers, while still providing that 20% for those power users who want it.

Data collection and real-time statistics to provide cost transparency, hopefully even upfront, although we learned how difficult that might be. Then integrations with our catalog to ensure that we’re doing proper cost attribution as we’re investing into more cloud offerings. We hope to accomplish faster adoption of cloud features, more services built with a cloud-first mindset, cost transparency upfront, which hopefully should overall reduce the cost of operating in AWS. Faster time to market, by easier provisioning of that infrastructure. Then, overall, once again, our goal is always improved developer efficiency and experience.

The big question we always get, though, is, what would you have done differently? There’s two things, find that daily feature sooner. It really wasn’t until we released that deployments feature that we achieved critical mass. Because that is a daily activity that developers had to do, so it provided them to go into that experience in UI every day. My recommendation would be, find that feature that makes sense to invest into, that’s going to drive traffic in there on a daily basis. I still strongly believe that the right foundation is starting with those catalogs, because that’s how you’re going to actually know about all of those systems and provide that value. I think getting to that daily feature sooner is really important.

Secondarily, minimize the usage of team names. Teams change. What we found in our experience was that service canonical name, which we embedded everywhere, was very likely to not change. It actually stayed pretty consistent. Whereas teams changed, a reorg happens. People change their team names. They shift under different managers, and all of a sudden you find that your infrastructure where you’re organizing things is out of sync with your actual catalog and system of record. I’d recommend, really lean into service canonical names and minimize your dependency on team canonical names, so that way it’s just easier. This is everything from Kubernetes to even just how you organize things in folders.

Questions and Answers

Participant 1: The numbers and the outcomes that you’ve shown were nothing short of amazing, the testimonials too. In hindsight, everything is 20/20. What was your process to handle pushback, especially in the beginning of this process?

Fodera: Early on, we started little. We actually only had one developer working on this for, I think, all of 2019. We had some spot assistance from a frontend developer to help us. If you’re getting resistance from investment into this, how did we continue to do this? I think that it’s really important to start small. Don’t try to sit there and say, this is a six-person team. We’re going to invest out of it all at the gate. I need a couple million dollars to do this. That’s not going to win. What we found was, leverage our innovation days. I talked about how we used the hackathon to prove the value of how important it is to eliminate human error.

Also, piggybacking off of the initiatives that the investment’s already being made into, and showing how you can accelerate those initiatives. If you can show that we’re already investing into a data center migration. If you go and approach your leadership and say, I can make that a lot smoother, higher chance of success or faster by investing into this feature in parallel, you’ll help with getting that investment.

Participant 2: In one of the slides, you show that developer productivity increased by 220%. How do you measure that?

Fodera: We did leverage the DORA metrics pretty heavily to show, from a flow perspective, as a team, how much faster you’re working. We got a lot of developer testimonials as well, which, from a qualitative perspective, would allow us to do that. What we found was that if given the exact same task that you needed to do in the monolith or even monorepo, versus that exact same task having to be done in a multi-repo service, they were able to do it about two times more quickly. That was pretty much how we leaned into it. A lot of that was eliminating what we call developer waste. We also outlined generally, what it would look like working on a feature in a sprint, and how much faster could you accomplish that with removing a lot of that developer waste.

Participant 3: How do you incorporate operations into this? When I say operations, I’m talking about infrastructure, infrastructure of services, architecture. Do you incorporate any service templates or architecture templates into this developer platform that speeds up the teams?

Fodera: We have the advantage that our team is part of our platform and infrastructure team, so we sit very closely with a peer of mine who runs more of the cloud infrastructure, so I’m constantly collaborating with them. I think that collaboration helps us really go in lockstep. I think what was really most important is that, like our templates, we did ensure that we had all those integrations out of the box. As we were having those templates be created, we made sure that it worked well from an infrastructure perspective. Really staying in lockstep: he’s a peer of mine, and he works very closely with me. I think that also helped from that perspective, from an organizational perspective, where we were set up for success.

Participant 4: I noticed that you showed a lot of UI based tools. However, I also know that infrastructure as code is important, especially when it comes to deployments and configuration, in which cases did you use the UI or infrastructure as code, and how did you combine the two?

Fodera: The cloud maturity one is an initiative that we’re actively working on, and that is a question that actually comes up. There’s actually a great talk by another company that talks about how you want to go with that 80/20 rule. What we found, and this was actually still true with our developers as we’re working with them. One, talk with your customers, who are your developers, in this case, and see what they want. It’s not going to be a one-size-fits-all for all companies.

What we found at CarGurus was that about 80% of people just wanted few button clicks to go and introduce a new service or get some database or whatever, and they really didn’t want to have to worry about learning Terraform, which is what we’re using under the covers for infrastructure provisioning. Lean into that 80/20 rule: 80%, whatever your 80% of your customers want, cater to that. If they want the UI, lean into that. If not, go with that approach of providing them the ability to self-service. If you have a company that everybody knows Terraform, probably not worth abstracting it away with the UI.

Participant 5: A part of your journey was migration from a monolith to microservices. Can you tell us a little bit more about your journey and what went well, and lessons learned? What recommendation could you give to other people who are going through this journey right now.

Fodera: Actually, the vertical slice model that I showed, that actually didn’t work well. I actually have a blog post that talks about how we failed a few times in our microservice journey, on cargurus.dev. It actually talks about our journey specifically for the monolith to microservices. The vertical slice approach didn’t work. That was trying to actually go and make it so we could vertically slice, detangle that big ball of mud. That actually proved to be very inefficient. That’s where we started with more of the strangler fig pattern, where we made it very easy to introduce new services instead of trying to detangle the existing.

Then we tried to enforce a culture where we said, as you’re introducing new features, do you actually need to introduce it into the monolith, or could you introduce it as a new service? Then we started with backend services only. That worked really well. Then we used the embankment approach that I talked about to help with the frontend services, and that helped a little bit. Then our shift to multi-repo, where we invested into a Remix template, was really that solidifying factor to help us decouple from a frontend perspective.

