Article originally posted on InfoQ. Visit InfoQ
Transcript
Fodera: I’m going to be talking about how CarGurus leverages our internal developer portal to achieve our strategic initiatives. I want to know, how many folks actually know what an internal developer portal is? An internal developer portal is really like a centralized hub that allows us to improve developer experience, developer efficiency by reducing a whole bunch of cognitive load. It’s usually internal, but the whole point is to centralize information into it.
We’ll talk about launching Showroom. Showroom is our internal developer portal that we built at CarGurus. How we actually achieved critical mass of adoption with our internal developers. Then, the foundation that we really built that helped us to set ourselves up for the future of our strategic initiatives that we’re trying to accomplish.
My name is Frank Fodera. I’m a director of engineering of developer experience. I’ve been at the company for about six years. My background is primarily in backend development, architecture, and platform engineering. Sometimes I’ll jump into the frontend as needed. I always found myself on customer facing teams, but CarGurus really helped me find my passion for improving the developer experience. That’s currently where I’m at. I do love staying technical. I found enjoyment in coaching and helping others grow, as well as achieving their strategic vision. Really like making a wide impact at the company, so I unexpectedly started gravitating towards leadership roles.
Developer Experience (DevX) – Our Mission
Before we jump into the actual tool, I want to talk a little bit about developer experience and what we’re trying to accomplish with our goals. Our mission statement was really to improve the developer experience at CarGurus by enabling team autonomy and optimizing product development. We do this in a couple of different ways. We have an architecture team that really invests into making sure that we have scalable architecture system design, and they do that for both the frontend and backend.
We have Platform as a Service team which really invests into providing a platform offering which provides a great developer experience for the developer workflows, our environments, and really all the day-to-day that we’re doing with how we’re working. We have a pipelines and automation team which focuses on build and delivery and how we actually get everything into production, and then all the quality gates that we’re investing into in order to make sure that we do that very seamlessly. Then we have our tools team, which is really the focus of this talk, which is internal developer portal, and the internal tools that are helping us improve that developer experience at our company.
Launching Showroom
Launching Showroom. When we first started Showroom, it really was just a catalog. We knew we had a problem, but we found that it eventually evolved over time, into what we call an internal developer portal, so it’s our homegrown solution. In this presentation, we’re going to talk a little bit about what problems justified why we created this ourselves. What outcomes did those solutions provide? How did we actually achieve critical mass of getting people to use this product? Then throughout, we’ll talk about a lot of these strategic initiatives that we piggybacked off of in order to invest into this product. Then at the end, we’ll wrap it up with a little bit of a foundation that we really created that helps us to continue to invest into this, as well as leverage this to move faster on achieving those strategic initiatives.
Our journey really started in 2019. Many of the talks have talked about tech modernization, or trying to optimize the way that we’re doing stuff into the cloud, and that’s where we started with our journey. We wanted to start moving into microservices. We called it our decomposition initiative. Our monolith was starting to get slower to develop in. We wanted to actually develop more features, but we found it difficult. We knew something needed to change. We needed to transform the way that we approached our development. Thus, we embarked on our microservice journey. Our problems that we had were that we knew we were going to have hundreds of services. We already had thousands of jobs, and ownership was unknown, and some of them were even unclear.
One of the engineers on my team actually said something in our Slack, where he said, everything that has no owner is going to eventually be owned by the infrastructure. We were an infrastructure team, so it was definitely a motivation for us to make sure that we had clear ownership, because we didn’t want to end up owning everything. Production readiness was another problem we had. We didn’t always know what platform integrations, if we were ready for production as we were trying to introduce these new services. There really wasn’t much transparency into that. Overall, we found that it was very difficult for us to create new artifacts. It was a very heavy platform investment. It took us a lot of handholding to get it across the line. It was very time consuming. We knew that there were some big problems that we wanted to solve.
This is actually what our monolith looks like. We actually bought a tool to try to go and look at our dependency graph within our monolithic architecture. We had all of these packages, all these modules, and we had almost no restrictions on how they could actually call each other, and we ended up with a big ball of mud. We knew we were dealing with something that was pretty bad, and we knew it was a difficult challenge for us to actually go and solve this. We thought our architecture looked like this.
We thought, yes, we have this monolith which has this frontend, it has all these packages it can call, and then it relies on a database. Nice and clean. That wasn’t the reality, as we just saw. What we were targeting was actually what we call these vertical slices. These vertical slices had everything that it needed from the frontend all the way to the database to really do what it needed to do, but it really only depended on the minimum set of things. We wanted those to be more isolated. We wanted to go into that microservice architecture, and provide a more decoupled way of operating.
We also knew that we were going to have a lot of these services getting created, so we prepared for the service explosion by making it more clear who owns things, but also enforcing that we had registration and ownership. We started very basic. We started with our catalog. We started with our service catalog here. It was a JavaScript React based frontend. Developers could go in there, see what things were owned, pulled out a whole bunch of information for them. They needed to contact the team, they even had a link to the channel in Slack to go and talk with them. We had an API layer that was REST based with good documentation from Swagger. It was talking to a containerized Java Spring Boot application, a service that was running in Docker and Kubernetes. Then, at the end, we actually had a MySQL database in RDS.
That really isn’t anything special. It’s pretty basic. Allowed us to catalog things, but really didn’t provide anything other than just centralizing that information. Where the big secret came in was this concept called RoadTests. RoadTests was something that we introduced into our CI system, which is Concourse, and it ran when you actually opened a PR. What this did was it used one of the APIs on our cataloging system and said, is this new artifact that you’re generating in our monorepo? We actually have a monorepo, so that worked to our advantage here. Said, is that service actually registered in our catalog? We use the concept of canonical names.
Those canonical names enforce that, if it’s not registered, we’re actually going to block your PR from getting merged in. You need to go into the catalog, register your service. We made it easy. We didn’t want to actually add too much overhead to developers, so it was a few button clicks. You could register it right there. This actually helped us to maintain and enforce that ownership as we were continuing to develop hundreds of new microservices.
We talked about jobs. Jobs was another issue that we had where we actually did have a whole bunch of jobs that were cataloged, but they were spread across four different instances between regions or environments. They were in this system called Rundeck, which is where we ran a lot of these batch jobs. What we did was we leveraged the Rundeck APIs, and said, we’re going to take a different approach. Instead of actually having these manually be added in as individuals were adding new ones, we already have a system that has all these, but you have to remember which ones to look into. We scheduled a nightly job. We used the Spring Batch framework within our code base, had a few APIs that pulled out the various pieces of information that we felt like it was going to be relevant for our developers.
On a nightly basis, we had a sync. It just synced all of those, put them into our catalog. Really what helped us with this is that we actually developed a way to intelligently classify our ownership. We did it with pretty good accuracy. We had about a 90% accuracy rating on how many we were actually able to classify the ownership with. Then those just ended up in our catalog. If we weren’t able to, we still have this banner at the top, which, as developers were going into the UI, they could go and say, there’s some jobs that actually don’t have ownership. Maybe I should go and classify it. Click on that link, and then they could see all the ones that don’t have owners, and they can manually claim them.
What did we accomplish with this? We knew our problem, services and jobs were unknown. We had unclear ownership. What were the outcomes? One-hundred percent of our services and jobs were now cataloged. We had zero undefined ownership, meaning infrastructure doesn’t own everything now. Service registration was enforced and job catalog was automatically synced. This is where we really started with our first two pillars. Our first two pillars of Showroom were discoverability, the ability to have that consolidated critical information in one place, so our services and jobs catalog. Then our governance, our RoadTests. Those RoadTests allowed us to ensure that we were continuing to maintain that ownership as we were continuing to develop at a pretty rapid pace.
Achieving Critical Mass
That didn’t help us to achieve critical mass. That was all great. It’s a great foundation. If you’re not looking for information about what services or jobs or owners, you’re not really going to be enticed to go into that UI. What did we have to do in order to achieve critical mass? We focused on another problem, still within that decomposition initiative. We had manual checklists that people were going through, either Excel or wiki docs, where they’re saying, have you actually done all of these checks? Are you actually going and incorporating all of these different things before you bring your service to production? We said, we know we can do better. We use that same Spring Batch framework. We introduced something called compliance rules. These compliance rules ran on a nightly basis as well.
It would check things like, have you actually documented your APIs? If you’re in a separate repository, are you using the right repo settings? Or, if you’re in the monorepo, are you actually in the right folder to make it easy to find where your service exists? Is your pipeline configured appropriately? Are you reporting your errors in all of the different environments to our Sentry system? Are you using sane memory defaults or memory settings for Kubernetes? What’s your test coverage like? Are you actually reporting test coverage? Where are your metrics going? Do you have metrics? Are you flying blind on observability? What really made this powerful was the fact that we made it extremely pluggable, so anything that had an API, you could go and easily extend this and introduce a new rule.
That made it so anybody, our own team who’s maintaining this, and even external developers to our team, were able to introduce these as we were going through and trying to come up with things that we were considering more golden path, best practices, things that you want to make sure you’re actually designing for and incorporating before you go into production. This really helped us to focus on that standardization, so now we knew who was actually following the best practices and who were not. You got this compliance score right in this UI. What we found was our developers cared about the score. They thought of it as a little bit of a game. They wanted to get to that 100%. They wanted to get to the high green 90s in here, and that helped to bring a little bit more traffic to our product.
What did this actually accomplish? Integrations were now transparent. Services were automatically scored. No longer did we have to keep these checklists. Production readiness was seen upfront. You didn’t actually have to do this after you were in production and check everything. You got to see this right as you were introducing the service, and the first time that this service was actually going and being run in production hardware. This was an enhancement to our governance pillar. Really making sure that as we’re developing new features into our internal developer portal, are we actually investing into things that we feel like should be there. We were like, yes, this made sense. It’s governance. It wasn’t as much of a hard enforcement as the RoadTests. You still were allowed to merge in. You still were allowed to go into production. We found that this was very useful, because our developers cared about making sure that they could observe their systems, that they had proper logs, as they were going into production.
Next, we started off with a feature called workflows, and this was all about self-service. We had that problem where it was taking very long to get our services into production, and we wanted to make that faster. What we did was we introduced this concept of workflows where it consisted of steps. Propose your service at the beginning and bring it all the way to production in an automated fashion. What we would do is use a Spring Batch framework here as well, so that way you can keep track of the progress as you’re going through all of these different steps, because there’s a lot of things to do as we’re going through here. We’d start off with collecting information. You no longer had to worry about saving your service into the catalog, because this automatically did it for you. If you needed approval from your manager, if you wanted to make sure your service canonical name was actually accurate and that you weren’t going to change that, they would check it upfront.
If you needed additional approvals, we can go and incorporate that. Then, best of all, it would notify others that this new service is being created, so it provided visibility into all of these new services. You want to move forward. Your stuff looks good. You get your approvals. We provided some templates to go and make it a little bit easier, so you didn’t actually have to worry as much as time went on about those platform integrations, because our templates provided a lot of those out of the box. If you’re using our best practices, using our templates, you get a lot of those features automatically. We cloned that template for you. We go and update those variables, set up your development environment, and say, it’s ready to start using it. Start testing it out, make sure it works as you want. Then you can move forward when you’re ready. We’re ready to move forward, so we go into our staging environment.
We automatically generate that pipeline for you. We verify that that pipeline is going to be successful, and we let it deploy into staging. We sync a whole bunch of data, and we’ll talk a little bit more about why that’s important later on. Then say, yes, go and start using staging. Make sure everything looks good. Then when you’re ready, come back and move into production. They come back and say, is your service going to be P1? P1 is priority 1, has a little bit of additional checks that you’re going to need to introduce. If so, if the person tells us, yes, we believe that my service is going to be P1, we’ll add that label for you, and it triggers off a whole bunch of other process. We’ll verify that the production pipeline is set up. We’ll use ourselves to actually deploy your service into production, verify that it’s up and running in Kubernetes. Then if you told us, I needed a database, we’ll actually go and create a database schema for you. Then notify everyone, this brand-new service is in production.
What did this accomplish? We knew that it was complex to introduce new artifacts. We know it required heavy platform integration, and it was very time consuming. We brought service production time from 75 days to introduce a new service down to under 7 days. It was completely self-service: minimal handholding, no tickets. Nobody needed to depend on another team just to get your service in production, fully self-service. It was really great. Helped with developer happiness. You could innovate faster. You can introduce your services into production and get them right and rolling. This is our third pillar. We said, self-service. We wanted to make sure that our internal developer portal actually invested into team autonomy. We said what our mission was: team autonomy, productivity. This allowed for faster iteration. This was our third pillar here, self-serviceability.
Our next main initiative was our data center and cloud migrations. A bit unexpectedly, we ended up having to migrate out of our data center in 2019. However, we knew that wasn’t our long-term play. We knew we wanted to be in the cloud, so we ended up doing a lift and shift model into a new data center to get us there. Then we lifted and shifted again into getting us into AWS. That way we had that time in between 2019 and 2022, when we finally moved to AWS, to really prepare ourselves to do that. Some of our problems that we faced were, we were going to be changing host names quite often, because we were going from data center 1 to data center 2, and then eventually into the cloud.
Our developers used a lot of these bookmarks and things like that, which would help them find their services, but that was going to become stale very quickly. We knew that was going to be a problem. Deployments were very error prone, and we were now going to be deploying in twice as many regions, across multiple data centers. We were going to experience even more issues with human error and actually causing deployments to be complicated. Then, what we realized from our data center 1 to data center 2 migration was we really lacked the ability to have more dynamic and configuration management because host name changes were actually complex. We invested into that in between our second data center to our cloud.
First, we started off with data collection. I talked about bookmarks. Our data collection feature is essentially a centralized bookmark repository that’s visible to everybody. What we did, we leveraged that Spring Batch framework, kicked off a nightly job, or in this case, you could actually run it on demand. You click into a service, into what we call our service details page, and you would go and find that it would collect all of this information for you. Where is my pipeline found? What’s my repo, or what folder am I in within the monorepo? Are there any associated jobs that are connected to the service that I should be aware of? Where do my Sentry errors go? Where am I reporting metrics to? How do I find my logs for all of the different configurations that I have, for all of the different environments, the regions? Where can I find that? What are my host names, my internal host names that I can use to start testing?
Once again, we made this very easy to extend so really anything with an API, we can go and start collecting this information. That wasn’t all. We could collect a whole bunch of information in an automated fashion, but we also had the ability for individuals to go into this page and add some custom bookmarks. Maybe there’s a really important documentation page that we wanted to actually have. What this allowed us to do is have those developers pay it forward. Next time your teammate was looking for that critical information, or a runbook, or what happens when the service goes down, you could have a link to that actual page that goes and tells you, here’s how you can go and start triaging things. Your mind is not really all there when you’re in this emergency situation, but if you know you have the centralized place to go to, you can find all of that critical information. It was really very helpful for our developers.
What did this develop? We knew information was quickly becoming stale. What were the outcomes that we were able to accomplish with this? We automatically collected thousands of relevant links and provided it all in one spot, relevant to the specific service that you wanted to look at. We had all these services, thousands of links, you could search, filter, find which exact one you wanted. It was extremely helpful. No longer had to bookmark things. No longer had to worry about remembering the syntax query for which log statement you were trying to find. It was all just there for you. This was our fourth pillar, transparency. Providing transparency in a single pane of glass for awareness and visibility, and data collection was our first feature of it.
Next was deployments. Deployments, we talked about a lot of human error. What we found was that when people were trying to roll back, sometimes they chose the wrong build. When people were trying to choose build, sometimes they didn’t check if it had actually passed all of our integration tests or all of our different checks that we had, in an automated fashion. What we did was we integrated with GitHub to get all the list of the builds. You could even view your impactful commits. It got even more complicated in monorepo, because in monorepo, when you’re making a commit, it actually needed to be intelligent to know which artifacts are you impacting. We actually had a very intelligent way to determine that. Now developers could click this link, see exactly what changes they’re going to be deploying in that build, less likely to deploy something that they didn’t want to.
They could very easily check integrating with our CI system, has it actually passed all of the different checks? Alex talked a little bit about CarGurus concept called Pit Stop Days. Pit Stop Days is where we have one day about a month, where we really allow developers to brainstorm new ideas and innovate. The funny thing is this project actually started with that. We brainstormed this particular feature and said, this is a big problem. We know that we could do better eliminating this human error, so we invested into a design. We got a team together in our next hackathon that we had, we actually invested into this. We talked about the value that it was going to be providing. We talked about how it could benefit our strategic initiative that we’re investing into. It was extremely successful in the hackathon, we actually got a very functional prototype. Then we were given that buy-in as part of the strategic initiative to go and invest into this.
A developer comes in here, hits deploy. What happens under the covers? Once again, we use Spring Batch, but this time it was a little bit more complicated. We now were living in an environment where you had monolith and microservices. Depending on which one you were in, you actually would use a different system to deploy. You would deploy either through Rundeck or deploy either through Concourse. To the developer, it didn’t matter. We were able to completely abstract that away and provide the exact same developer experience to them, regardless of whether you are working on a monolith, working on a microservice, and then later on, even working into a separate repository. We provided a lot of convenience features. You wanted to see your logs, you know you were deploying a canary server for this service at this time, with this build, that log link dynamically generates it for you. You could see the logs right in the UI.
If anything was looking bad at the end of your deployment, you’d have this rollback button. It’s as simple as that. You would just click, roll back. Picks the build for you, knows exactly what build you were previously on. If everything looks good, you make sure your service is up and running in Kubernetes, you could go and proceed forward. We also had this culture at CarGurus where developers really wanted to get notified through Slack. We have a very Slack heavy culture, so we also integrated with notifications, where, as you were progressing through the various phases, we would notify you with good notifications, custom that we’ve made in Slack about what commits, who’s impacted it, how many commits are going out, with a link back to our service to actually go and help us.
This not only had people familiar with their Slack workflow, but encouraged them to come back to our UI to look at this as a visual status, rather than as a Slack notification. What did this accomplish? We actually eliminated human error during deployments almost wholesale. We found that there was almost no human error because we were able to design it away. Saved us about 7000 developer hours just in the time that it was launched. Really a huge success, and something that we’ll talk about why this was so critical to achieving critical mass.
This is our last pillar. The last pillar was operational efficiency. We really wanted to minimize fragmentation and cognitive load. We didn’t want developers to have to remember to go to all of these different regions to deploy. We really wanted to make sure that they were deploying by just clicking a button. They had the commits that were going out. They knew that upfront. They got to choose which commits were going out, rather than just blindly picking one because it happened to be the latest version, which may or may not have actually passed its checks. Then we provided really good ease for the log statement, so that way you could actually see those right in the UI.
We talked about configuration management. We knew host names were going to be changing. We knew that it wasn’t easy to actually manage these. We went through that painful process, through our first data center migration. What we did was we introduced this concept of configuration management. We introduced a service, primarily through CLI called Mach5. We like our car puns and naming. We had this Mach5 service which did really three things. It managed our environments for us, automated our dev deployments, and actually staging and production, so that way we contained parity. Then it introduced the concept of configuration management, both static and dynamic.
We introduced this UI within our internal developer portal that allowed developers to go in and change their configuration for the things that were static, things that weren’t going to change across the different environments, the different regions that you were running. This was just injected right into your service as we were starting up. The Mach5 service handled that for you. Then we also had dynamic configuration. The dynamic configuration really took away the whole need to even know or care about what your host name was going to be for a particular service in an environment within a region. It just said, I’m deploying in North America. I’m deploying for my production environment. I depend on service x, and it automatically knows what the host name is for service x. Completely obfuscated that away.
For our development workflows, it provided us this opportunity to make it so we had development, staging, and production all deploying in Kubernetes in a very similar way, and having environmental parity as close as we could get. We didn’t get perfect environmental parity, but as close as we could get. That allowed us to find a lot of these issues that prior to this were just, it worked in staging, it worked in development, but now it didn’t work in production. It eliminated a lot of that. Then these environments were fully managed for you, so you didn’t have to worry about it. That’s where we created the second feature here, which is the environmental visibility.
Because Mach5 was primarily CLI based, we had the ability to show visually, what services do you have deployed in your personal namespace? What services does your service depend on? You can click a button, click into your service and say, I’m experiencing an issue with my service right now. Is it actually some of my code that I wrote, or is a service that I depend on actually having an issue at the moment? Visually you could see, there’s a big red dot right there. That service is probably having an issue. Let me actually click into that service and see who owns it, so I can go and talk to them and see if their stable version that I’m depending on is not stable at the moment.
What did this accomplish for us? We had now proper configuration management enabled for all host names to be dynamic. Our static configuration was all centralized, so it was one place. We actually eliminated the pull request process. It was fully self-service as well. We launched three successful migrations, one for North America, one for EU, and then, once again, into AWS. Three successful migrations, which is a pretty huge feat. We didn’t miss anything because we had everything cataloged and we knew what we had to lift and shift. That also was huge. This is where we enhanced two more, as we were going through and developing these new features, we constantly had to ask ourselves, did it align with what our vision was for this internal developer portal? Yes, these did. It provided a self-service ability to configurations. It eliminated that manual process that we had to do to go and approve your pull requests. It provided transparency into your service, so you actually had now visibility into your environment, and you knew exactly what was running, what you were depending on, if that service was having an issue. We provided more insights to those developers.
One of the more recent initiatives that we launched, and we primarily were operating in a monorepo, but we really wanted to move to what we called multi-repo, so multiple repositories. It was 2022, we were officially in the cloud. We had many microservices at this point, but coupling remained to be an issue. We were like, I thought microservices were going to solve everything for us. No, that’s not what happened. We actually did make a good amount of attempts to go and ensure that proper build and compile time was isolated, but it was still proving difficult while everyone was really operating on a single monorepo. We found that more microservices made it difficult to find real-time information. We couldn’t easily create new libraries and repositories. It was complex and time consuming.
Overall, most importantly, it was proving to be very inefficient from a build, deploy, and development perspective, to operate in a single repository. We had this architecture. In monorepo, we introduced this concept that we called embankment, which was really trying to mimic a multi-repo environment in a monorepo. It encouraged us to prevent ourselves from having dependencies on what we called mainline, which was where all of our existing artifacts had lived. It also allowed us to introduce reusable libraries that were more properly versioned. It wasn’t good enough. We wanted to move to what we call this multi-repo, where you had each artifact being produced out of one repository. Those artifacts could depend on each other. Then you could depend on a whole bunch of reusable libraries that are properly SemVered.
We had this real-time information. Services were now further spread apart from each other, and we really wanted to provide a single pane of glass for all the information that you needed. We said, let’s go and integrate with Prometheus to find how much CPU memory your service is using. If we could find your service in Kubernetes, we’d go and tell you how many pods you have. What the status of those pods are. Are they restarting? We also wanted teams to feel empowered to improve their workflows and get more efficient. We actually implemented all four DORA metrics and provided visibility wholesale for all of the microservices, so now they could see what my change failure rate was, what my lead time for change is, and so on. We also allowed you to see, are you on the latest build?
Did somebody check in something and then nobody really remembered to deploy it, because we still were deploying manually. We also cared very deeply about security and quality, so we integrated with our security system and our quality system to go and provide that all upfront. Now you could see, do I have vulnerabilities? Do I have good coverage? Are there bugs that I have that I could go and fix? Really centralizing all of this. Now you had health, DORA metrics, build, security, quality metrics all available and easy to find, all in one place. This once again aligned with our pillars here. We had our governance, which was security analysis, allowed us to go and have that visibility more upfront, shifting further left. We had those statistics which was all about transparency, really reinforcing what we had.
Then we had a reusable workflows framework. This looks very familiar. We had steps and tasks, but what we learned this time was that we could generalize it, so we invested into refactoring this. We made it more robust, easier to extend by having these options to really introduce and ask any questions, collect any information, catalog that information, seek any approvals, execute any type of thing, notify. You could verify that things were actually set up successful. Really, you could run any arbitrary task, and that allowed us to introduce a whole bunch of new workflows for self-service, things like creating a new library, creating a new backend service directly in multi-repo.
New application, that’s what Alex’s talk was all about. All about creating those new applications in Remix. We actually even provided an ease of use to say, during a certain period of time, we’re going to allow you to have an automated way, as automated as we could, to migrate yourself to multi-repo. Then, if you forgot to actually say that you needed a database, or you didn’t know when you were introducing the service, you can now self-service that at any point in time and create a database or create a dashboard for yourself.
This allowed us to actually further enhance that time to production for services and libraries. We went from that 7 days down to 2.5. Library were manually created, taking about an average of 10 days from start to publishing it. We saw that it was taking one day now to introduce a bare bones library that was published. Really, some huge successful wins. Once again, going into our discoverability pillar, really helping us to go and invest into that cataloging. We introduced library cataloging, and then actually team cataloging too. Overall, though, I do want to talk a little bit about what this multi-repo project provided from an outcomes perspective. We helped to accomplish a lot of these with this internal developer portal.
A lot of work was put in across all of the engineers at CarGurus to really invest into this tech modernization. What did we accomplish? Our lead time for changes went down by 60%. Change failure rate dropped to 5% from 25% for our monolith. Build times were 96% faster because we didn’t have to worry about that centralized pipeline that was on the monorepo. Deploy times were on average, 70% faster. Best of all, we found that our developers were actually 220% more efficient. They were happier because they were able to move faster, accomplish more, less roadblocks, less overhead, very powerful for them.
Foundation for the Future
Now we had a foundation. We had these five pillars. These five pillars really allowed us to continue to invest. You’ll see, I’ve added a few more here that I haven’t talked about in our talk, but we had a whole bunch of different features that we’ve invested into that really kept aligning with these five pillars. We’ve stayed true to that to continue to do that. We’re not just adding everything because we have a centralized portal, but because it aligns with our mission and our vision of providing that autonomy and providing that productivity boost for our developers.
I also want to talk a little bit about what we’re currently working on. We’re currently working on another initiative called time to market acceleration. The problems here is, yes, it’s 60% faster to get services into production, but it still takes days from commit to production. Quality issues are often found too late in our development cycle. Deploying to production is still manual. You still have to click a few buttons to do it. We plan on heavily leveraging our compliance rules to determine if you’re actually ready for CI/CD. We plan to leverage labels within our catalogs to track the migration of who’s moving over to this new full CI/CD model, which is the goal of this project.
Then continuing to provide a great experience by having that single pane of glass, regardless of what type of deployment model you’re in. The outcomes we predict here are getting our lead time to changes down to under 60 minutes. Maintaining a change failure rate of 5% despite moving multiple times faster. We’re hoping to lower our defect escape rate. Then have an improved developer efficiency by eliminating that manual deploy step that we have to do at the end.
Secondarily, we talked about how we lifted and shifted into the cloud from our data center. Not great, but it helped us do it very quickly and very successfully. Another initiative that we’re launching is cloud maturity. We’re operating fully in AWS, but we’re not really fully leveraging all the offerings that we have. Our services are not actually always built with the cloud-first mindset, so we can do better there. It’s actually difficult to understand the cost implications. It’s hard to understand the cost implications of a design decision now that we’re operating in the cloud. We plan to use our catalog to know what’s available, so you can reuse stuff instead of developing net-new. We plan to invest more into our workflows to help us self-service infrastructure provisioning, making it easier for developers, while still providing that 20% for those power users who want it.
Data collection and real-time statistics to provide cost transparency, hopefully even upfront, although we learned how difficult that might be. Then integrations with our catalog to ensure that we’re doing proper cost attribution as we’re investing into more cloud offerings. We hope to accomplish faster adoption of cloud features, more services built with a cloud-first mindset, cost transparency upfront, which hopefully should overall reduce the cost of operating in AWS. Faster time to market, by easier provisioning of that infrastructure. Then, overall, once again, our goal is always improved developer efficiency and experience.
The big question we always get, though, is, what would you have done differently? There’s two things, find that daily feature sooner. It really wasn’t until we released that deployments feature that we achieved critical mass. Because that is a daily activity that developers had to do, so it provided them to go into that experience in UI every day. My recommendation would be, find that feature that makes sense to invest into, that’s going to drive traffic in there on a daily basis. I still strongly believe that the right foundation is starting with those catalogs, because that’s how you’re going to actually know about all of those systems and provide that value. I think getting to that daily feature sooner is really important.
Secondarily, minimize the usage of team names. Teams change. What we found in our experience was that service canonical name, which we embedded everywhere, was very likely to not change. It actually stayed pretty consistent. Whereas teams changed, a reorg happens. People change their team names. They shift under different managers, and all of a sudden you find that your infrastructure where you’re organizing things is out of sync with your actual catalog and system of record. I’d recommend, really lean into service canonical names and minimize your dependency on team canonical names, so that way it’s just easier. This is everything from Kubernetes to even just how you organize things in folders.
Questions and Answers
Participant 1: The numbers and the outcomes that you’ve shown were nothing short of amazing, the testimonials too. In hindsight, everything is 20/20. What was your process to handle pushback, especially in the beginning of this process?
Fodera: Early on, we started little. We actually only had one developer working on this for, I think, all of 2019. We had some spot assistance from a frontend developer to help us. If you’re getting resistance from investment into this, how did we continue to do this? I think that it’s really important to start small. Don’t try to sit there and say, this is a six-person team. We’re going to invest out of it all at the gate. I need a couple million dollars to do this. That’s not going to win. What we found was, leverage our innovation days. I talked about how we used the hackathon to prove the value of how important it is to eliminate human error.
Also, piggybacking off of the initiatives that the investment’s already being made into, and showing how you can accelerate those initiatives. If you can show that we’re already investing into a data center migration. If you go and approach your leadership and say, I can make that a lot smoother, higher chance of success or faster by investing into this feature in parallel, you’ll help with getting that investment.
Participant 2: In one of the slides, you show that developer productivity increased by 220%. How do you measure that?
Fodera: We did leverage the DORA metrics pretty heavily to show, from a flow perspective, as a team, how much faster you’re working. We got a lot of developer testimonials as well, which, from a qualitative perspective, would allow us to do that. What we found was that if given the exact same task that you needed to do in the monolith or even monorepo, versus that exact same task having to be done in a multi-repo service, they were able to do it about two times more quickly. That was pretty much how we leaned into it. A lot of that was eliminating what we call developer waste. We also outlined generally, what it would look like working on a feature in a sprint, and how much faster could you accomplish that with removing a lot of that developer waste.
Participant 3: How do you incorporate operations into this? When I say operations, I’m talking about infrastructure, infrastructure of services, architecture. Do you incorporate any service templates or architecture templates into this developer platform that speeds up the teams?
Fodera: We have the advantage that our team is part of our platform and infrastructure team, so we sit very closely with a peer of mine who runs more of the cloud infrastructure, so I’m constantly collaborating with them. I think that collaboration helps us really go in lockstep. I think what was really most important is that, like our templates, we did ensure that we had all those integrations out of the box. As we were having those templates be created, we made sure that it worked well from an infrastructure perspective. Really staying in lockstep: he’s a peer of mine, and he works very closely with me. I think that also helped from that perspective, from an organizational perspective, where we were set up for success.
Participant 4: I noticed that you showed a lot of UI based tools. However, I also know that infrastructure as code is important, especially when it comes to deployments and configuration, in which cases did you use the UI or infrastructure as code, and how did you combine the two?
Fodera: The cloud maturity one is an initiative that we’re actively working on, and that is a question that actually comes up. There’s actually a great talk by another company that talks about how you want to go with that 80/20 rule. What we found, and this was actually still true with our developers as we’re working with them. One, talk with your customers, who are your developers, in this case, and see what they want. It’s not going to be a one-size-fits-all for all companies.
What we found at CarGurus was that about 80% of people just wanted few button clicks to go and introduce a new service or get some database or whatever, and they really didn’t want to have to worry about learning Terraform, which is what we’re using under the covers for infrastructure provisioning. Lean into that 80/20 rule: 80%, whatever your 80% of your customers want, cater to that. If they want the UI, lean into that. If not, go with that approach of providing them the ability to self-service. If you have a company that everybody knows Terraform, probably not worth abstracting it away with the UI.
Participant 5: A part of your journey was migration from a monolith to microservices. Can you tell us a little bit more about your journey and what went well, and lessons learned? What recommendation could you give to other people who are going through this journey right now.
Fodera: Actually, the vertical slice model that I showed, that actually didn’t work well. I actually have a blog post that talks about how we failed a few times in our microservice journey, on cargurus.dev. It actually talks about our journey specifically for the monolith to microservices. The vertical slice approach didn’t work. That was trying to actually go and make it so we could vertically slice, detangle that big ball of mud. That actually proved to be very inefficient. That’s where we started with more of the strangler fig pattern, where we made it very easy to introduce new services instead of trying to detangle the existing.
Then we tried to enforce a culture where we said, as you’re introducing new features, do you actually need to introduce it into the monolith, or could you introduce it as a new service? Then we started with backend services only. That worked really well. Then we used the embankment approach that I talked about to help with the frontend services, and that helped a little bit. Then our shift to multi-repo, where we invested into a Remix template, was really that solidifying factor to help us decouple from a frontend perspective.
See more presentations with transcripts