Article: Hard-Won Lessons from the Trenches: Failure Modes of Platform Engineering — And How to Avoid Them

MMS Founder
MMS Aaron Erickson

Article originally posted on InfoQ. Visit InfoQ

Key Takeaways

  • Forcing people to use your platform can lead to “malicious compliance.” Every time the platform fails them, however, they’ll blame it — and your platform team.
  • If your IDP forces people to accept unworkable practices, it won’t succeed.
  • Don’t sacrifice security to implement your platform faster. When things go wrong, your biggest fans will become your worst detractors.
  • A strong focus on the costs and budget of your platform will help ensure the business continues investing the next time budgets are under review.
  • Start small and iterate. You probably don’t have the budget and resources to reinvent Heroku.

It’s impossible to dispute the value platform engineering brings businesses as a whole. Not only did Gartner name platform engineering as one of its top 10 tech trends for 2023, but it also included it in its well-known hype cycle, making a strong case for how companies can improve developer experiences by building internal developer platforms (IDPs). 

Even more relevant, the platform engineering community is rapidly growing. In other words, this is not just about decisions made in some boardroom. Developers who use internal developer platforms have gained an indisputable appreciation for their advantages on the front line, like increased developer productivity, better product outcomes, better developer experiences, and lower costs. 

In today’s world, where everyone is doing something in the cloud and workforces are clamoring for more distributed opportunities, IDPs have gained a reputation as competitive necessities that dramatically improve developer quality of life. 

Unfortunately, this also means that expectations are sky-high — so anything short of inventing a new Heroku (but of course without any of its limitations) might be seen as a failure.

It’s critical to double down on anything and everything that increases developer productivity. But before you start, it’s all about knowing what not to do: Understanding where the pitfalls lie makes dodging them easier. So here are some of my takeaways from creating IDPs for Salesforce and other companies, seeing countless efforts crash and burn, and watching platform engineering evolve.

The Failure Modes That Plague Platform Engineering Efforts

There are numerous ways platform engineering projects can fail — including some that’ll hamstring them before ever they reach maturity. Watch out for these wrongheaded modalities as you create your IDP:

The Build It and They Will Come Fallacy

This is a big logical error. Know what you’re doing by assuming people will use your platform just because you’ve built it? Say it with me: You’re setting yourself up for failure. 

Sure, your platform may be better than the already-broken systems you’re trying to improve on. That doesn’t mean it won’t have its own pain points, like wasting time or failing to serve developers’ needs.

The best-case outcome in such a scenario is that you’ll foster discontent. Users will just grumble about the new problems you’ve saddled them with instead of complaining about the old ones. This is not an improvement.

What’s more likely is that you’ll engender an environment of “malicious compliance.” People will use the platform because that’s what they’ve been told to do. Every time it fails them, however, they’ll blame it — and your platform team. Way to go, scapegoat!

This does far more than just harm your career. It also prevents future adoption, not to mention poisons the well of your corporate culture. 

Fortunately, there’s a straightforward solution. Be a good, dedicated product manager, and get closer to your customer — preferably well before you start building anything.

Projects that focus on what the platform team thinks the end product should include (instead of asking the users what they really need!) get irrecoverably bogged down. In other words, understanding your typical user personas in advance should be your guiding star.

Remember: Improving the developer experience is a crucial goal. Solicit regular feedback so that you understand what to build and how to improve. By adopting a user-centric product management approach, you’ll naturally promote usage by solving important problems.

The One True Way Falsehood

Building IDPs means building golden paths for developers. The problem is that some would-be golden paths don’t quite hold up to scrutiny. There’s no “one true way” to develop software, and if your IDP forces people to accept unworkable practices, it won’t succeed. 

At one point or another, devs who use overly opinionated tooling will come up against edge cases you didn’t anticipate — and being good devs, they’ll start looking more closely to devise workarounds. Then, it’s only a matter of time before they’ll realize: “Hey! These are just a bunch of bricks somebody’s painted yellow!”

Your golden paths need to accommodate people’s tendency to go offroad — and be sufficiently adaptive to match. For example, your IDP can still succeed without supporting queueing technologies like managed Kafka or distributed computation frameworks like Hadoop out of the box. But your IDP should be architected in a way that allows all integrations you can think of, so you would be able to extend later. The architecture of the platform must already take into account that developers will want to self-serve future technologies you even haven’t heard about.

The People Pleaser Platform

While the “one true way” antipattern leads to certain failure, so does the opposite — building a “people pleaser” platform. After all, you’ll never make everyone happy, and trying will likely make things worse.

Not every feature request carries equal weight. Imagine one of your teams wanted to use some cutting-edge experimental technology. Integrating said tooling might make them happy, but it could also result in widespread instability — not quite a desirable platform characteristic. 

In other cases, you’ll simply oversell what you can do and inevitably end up disappointing. For instance, you may have lots of teams that use diverse tech stacks. By saying “sure, we can support them all, no problem,” you’ve condemned yourself to a kind of Sisyphean torture. 

Remember that you can only do so much. More importantly, understand that the thinner you spread your team, the more the quality of your work will suffer. Even if you working with massive amounts of funding and resources, you can’t support every conceivable combination of technologies — let alone do so well!

Instead of trying to please the entire organization, start with an MVP and work closely with a lighthouse team that has an early adopters mindset and is willing to give continuous feedback. By helping you guide improvements and enhance production efficiency, your lighthouse project will let you satisfy needs as they arise. Of course, you’ll still need to prioritize ruthlessly as you go, but that’s a lot easier when you aren’t in the habit of overextending. Once you gained stability, you can think of rolling out the platform to more teams before you expose it to the entire organization.

The House of Cards Architecture

You might be creating a platform in the hopes of helping your organization reach new heights — but it shouldn’t come out looking like a Jenga tower of unstable technologies. There couldn’t be a worse strategy when you’re trying to create a foundation for the products and services that let your company thrive.

So how does this occur? After all, no platform team sets out to create a royal mess — but I’ve seen it happen countless times. 

The problem often lies in how teams architect their platforms. Many try to weld together a slew of immature technologies that are still in the early phases of their lifecycles. Even if you can keep up with a few of these fast-moving components, the combined effect compounds exponentially — leaving you in the dust. 

Avoid the house of cards antipattern by pacing yourself early. Don’t be hesitant to tackle the tough, unsexy tasks instead of going straight for the flashy stuff — Raise your odds of success by starting with the essentials. 

Imagine you were building a house: You’d do things in a specific order. First, you’d pour the foundation, then build the frame and walls, and lastly add the doors, windows, and trim. As I mentioned in a recent blog post, that means you start with designing your IDP’s architecture before you think about building a shiny UI or service catalog. 

And sure, I’ll admit that this strategy might not be the most glamorous — but laying the groundwork in the proper order pays off later. On top of increasing your IDP’s stability, constructing a solid base can make other tasks more doable, like integrating those cool technologies that teams have been clamoring for since the beginning.

The Swiss Cheese Platform

Swiss cheese has a lot going for it. For instance, it’s wonderfully aerodynamic, lightweight, and pretty tasty. Like Jenga, however, it’s not a good look for an IDP.

The problem is that you don’t always get to pick where the holes are — and some are deadlier than others. While you might be able to overcome gaps in areas like usability, security is an entirely different ballgame. 

Remember: It only takes one vulnerability to cause a breach. If your platform is rife with holes, you dramatically increase the odds of compromising data, exposing sensitive customer information, and transforming your organization from an industry leader into a cautionary tale. 

Building a platform that everyone loves is a noble pursuit. If you’ve done so at the expense of security, however, then your biggest fans will become your worst detractors the instant things go belly up. 

The solution here is straightforward: Make security a priority from day one of the project — or better, before you even get to the first day of coding. 

The only valid role Swiss cheese has in systems engineering is when you’re talking about risk analysis and accident causation, which drives home the point: You need to keep security front of mind at all stages. 

The Fatal Cost Spiral

This is a big one. Many teams create platforms that lack inherent cost controls — like AWS provisioning quotas. This tendency to go “full steam ahead, budget be damned,” is no way to build.

The drawbacks of this antipattern should be obvious, but it’s easy to overlook their scope. Every platform engineer understands that it’s bad to exceed their project budgets. Few appreciate the broader impacts cost overruns can have on a company’s unit economics. Many companies that are compute-intensive spend more on the cloud than they do on offices and payroll combined. Bad unit economics for a tech company can literally make the difference between meeting earnings expectations or not.

Unfortunately, this mental disconnect seems to be quite deep-rooted. I’ve been in companies where you couldn’t order the dev team a pizza without a CFO’s approval — yet anyone on the payroll could call an API endpoint that would spin up hundreds of thousands of instances per day!

Getting everyone in your organization on board with cost-conscious DevOps may be a long shot. As a platform architect, however, you can lead the charge. 

Consult with your FinOps liaison to keep your DevOps undertakings tightly aligned with your company’s financial architecture. If you don’t control the bottom line, your platform is doomed — no matter how much investment you secure early on!

Vital Takeaways for Smarter Platform Engineering

So what differentiates winning platform engineering efforts from failures? In review, you need to:

  • Approach your platform from a product manager’s perspective
  • Sell your platform, but don’t oversell it, 
  • Treat your platform like a product, and identify who your key customers and stakeholders are,  
  • Accept that you won’t reinvent Heroku or AWS unless you’ve got hundreds of millions of dollars to spend, and
  • Understand and iterate on an MVP that sets you up to win the next round of investment.

Of course, the failure modes listed thus far are just some of the ways to go wrong. Other common traps include mistakes like engineering IDPs to satisfy the loudest voice on the team at the expense of the quietest and sacrificing critical access to underlying technologies solely for the sake of abstraction. No matter which of these hazards proves the most relevant to your situation, however, gaining a broader appreciation for the risks in the early planning stages is a step in the right direction.

Know what to avoid — but also gain an appreciation of what to concentrate on. Focus on the things that make your platform engineering team more productive. Doing so will make it vastly easier to succeed and build something that delivers lasting organizational value. 

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.