Article: How Skyscanner Embedded a Team Metrics Culture for Continuous Improvement

MMS Founder
MMS Ramsay Ashby Laz Allen

Article originally posted on InfoQ. Visit InfoQ

Key Takeaways

  • Data has been used to drive organisational improvement for decades. More recently, new metrics, such as DORA, have started to emerge to improve the pace of software delivery, but adoption is not straightforward and can easily result in fear and rejection from team members and their leaders.
  • Focusing on culture change rather than tool-use produces better and longer lasting results, and requires a fundamentally different approach.
  • In culture change, initiatives being ready to change the approach at short notice is vital. Being prepared from the start is therefore key.
  • Thinking about who the primary users of the tool would be and how they would want to use it clearly favoured some vendors over others. We selected one that would enable our team and reinforce the cultural change that was the fundamental goal of the exercise.
  • Focusing adoption on a small subset of metrics enables us to circumvent the barriers to ensure we made the most impactful improvements.

Team metrics are in vogue: there are a plethora of companies offering tools to support our agile practices and there is no shortage of books espousing the value of various combinations of metrics that will lead your team(s) to delivery stardom. In this article we will share our journey, starting from the point of realising this could enhance our competitive advantage and the signs that indicated change was needed.  We explain why we choose not just to deploy a tool, but to think about this as changing our engineering management culture to being one that values and utilises these metrics to drive greater improvement at scale. 

We will explain how our understanding of what we needed evolved throughout the exploration process and how we ran trials with teams to understand this better. We will share our interpretation of the most important metrics, the sources we were inspired by and how we expected to need to augment them. Then, we’ll explore the challenges we discovered with adoption, those we predicted and how we altered our plans to adapt to them. The piece will then take a look at the values we created to assess the various tool providers to find one that we felt would enable our team, and more importantly, reinforce the cultural change that was the fundamental goal of this exercise. 

At the time of writing, Skyscanner was guiding 85 squads (~700 engineers) through this change and we have been seeing good levels of adoption and insights being acted on, having created cross-organisation communities of practice to support this change and bed it in for the long term. We will talk about the principles we have used to drive this adoption.

Exploring Metrics

Skyscanner is an organisation that continually strives to improve and knows that we need to keep adapting as opportunities change around us. To that end, we need to continually strengthen our approach to ensure continuous improvement at all levels. We already use data to inform our strategic and day to day product development decisions, so it made sense for us to achieve the former by thinking more deeply about the latter.  The space here has been maturing for some time with works like Accelerate, Agile Management and Project to Product, so we decided to explore how we could apply these concepts and practises across the organisation. 

Travelling back in time a decade or so, Skyscanner released an update to its monolithic codebase on a six-weekly release train. This train made it very easy to understand how much ‘value’ we were able to release in a given period and also to get a good read on the overall health of our code and engineering org because it was easy to compare the contents of one release to the next, and what was expected to ship. 

But in a world where teams are potentially deploying changes multiple times each day, we don’t have that same implicit view of how long new features take us to build – because each deployment usually contains just a tiny part of a more significant piece of work. We recognised that our CI & CD practices had matured to a point where we needed to explicitly seek out some information that would once have been fairly obvious. 

We knew that helping our teams have visibility of the right metrics would enable them to identify and act on the most impactful improvements to how they worked together. We also recognised that we could aggregate these metrics in some cases to identify organisation-wide improvements. 

Selecting Metrics

We started from two places. The DORA metrics were already well understood as a concept in our engineering organisation so there already was an appetite to explore using them more sustainably and consistently across our 85 squads. These metrics are reasonably well understood to assist in the improvement of DevOps practices. That said, we knew we needed more than that as DevOps practices; integration and deployment only count for a small part of the overall value stream. Additionally, they are also the most automatable parts of the process and are often optimised in isolation of the rest of the value streams. 

So we began looking at the Flow Metrics that Mik Kirsten describes in his book, Project to Product. These metrics resonated strongly with us, and the journey that the book described also felt fitting. These are termed, flow time, flow load, flow distribution, flow efficiency, and flow velocity.

But when we started talking about these metrics with our trial group, we found that the new novel vocabulary increased confusion and hindered adoption. Within the software industry, terms like “cycle time” and “work in progress” are already well understood, so we opted to stick with them to make a gentler learning path over the vocabulary that Kersten uses in the book. 

The core set of metrics that we went with were:

  • Cycle time for epics, and stories – how long it takes us end-to-end, including all planning, development, testing, deployment, etc. to get a piece of work live in production
  • Work in progress – how many concurrent work items are in any “in progress” state (i.e. not “to do” or “done”) on a team’s board at any given time
  • Waiting time – the amount of time that a piece of work spends in an idle state, once work has begun. The time spent waiting for a code review is one of the most significant contributors to this metric
  • Distribution of work (across different categories) – tagging each work item as one of “defect”, “debt”, “feature”, or “risk” and tracking the overall distribution of these categories (expecting a healthy mix of each)

Our initial trial included two squads, and involved the two capturing these metrics manually in Excel and discussing them regularly in their retrospectives. This allowed us to very quickly prove that we could drive meaningful improvements to team delivery. We saw a >37% reduction in epic cycle time just from limiting work in progress and small tweaks to a team’s way of working, such as better prioritising code reviews, which was far beyond our expectations. 

Evaluating Tools

We started from the position that we wanted to have all our value stream and productivity measures in the same place. At Skyscanner, we have a lot of tools so only introducing one new one rather than multiple was very important. It would also make rollout and communication much easier too. 

From there we started reviewing the available tools and speaking to the teams building them; a few key deciding factors become clear to us:

By looking at the underlying ethos of the tools, we identified that some were aimed at managers, heads of departments and CTO’s, others were aimed at being most useful to teams themselves. These two approaches produced quite different user experiences; the product looked different and you could tell which audience they were mainly aimed at.  We felt fairly safe in predicting that a tool that did not put use by teams first would not be used by our teams, and also that a tool that was not used by our teams would not produce meaningful data for anyone else.  So that use case became our primary focus.

We also looked at how “opinionated” the tool was about the team’s ways of working. Some of the tools required teams to work in defined ways in order for them to produce meaningful data, whilst others were much more adaptable to different ways of working. At Skyscanner we encourage teams to adapt and refine their ways of working to best fit their needs, so we didn’t want to introduce a tool that forced a certain convention, even if the data would ultimately lead teams to make some changes to how they worked. 

After going through this process and trialling a few of the tools, we settled on using Plandek to actually gather the metrics and present the data and the insight to our teams.  The team at Plandek has been great, working with us throughout, understanding our needs and giving us relevant advice and in some cases making additions to the product to help us get to where we want to be. 

Changing Culture

This was probably the part that we put the most effort into, because we recognised that any mis-steps could be misinterpreted as us peering over folks’ shoulders, or even worse, using these metrics intended to signal improvement opportunities to measure individual performance. Either of those would be strongly against the way that we work in Skyscanner, and would have stopped the project in its tracks, maybe even causing irreversible damage to the project’s reputation.  To that end we created a plan that focused on developing a deep understanding of the intent with our engineering managers before introducing the tool.

This plan focused on a bottom-up rollout approach, based on small cohorts of squad leads. Each cohort was designed to be around 6 or 7 leads, with a mix of people from different tribes, different offices, and different levels of experiences, covering all our squads. The smaller groups would increase accountability, because it’s harder to disengage in a small group, and also create a safe place where people can share their ideas, learnings, and concerns. We also endeavoured to make sure that each cohort had at least one person who we expected to be a strong advocate for team metrics, and no more than one person that we expected to be a late adopter or detractor. This meant that every cohort would generally have a strong positive outlook, with a minimal risk of collectively building a negative sentiment. 

Cohorts would meet every 2-4 weeks, where we’d take them through a little bit of structured learning – perhaps explaining what a particular metric means and why that is useful, as well as some tips for improving it – followed by a healthy amount of discussion and knowledge sharing. Between sessions we’d encourage people to try things out, take data back to their teams to discuss in retrospectives, etc. and then bring any insights back to the next cohort session. 

Whilst with hindsight we’d make some minor tweaks to the exact composition of some cohorts where perhaps some advocates weren’t as strong as we’d expected, overall this approach has worked very well for us, and we intend to follow this pattern for similar rollouts in the future. As we progress through Q1 in 2023, we will re-balance the cohorts to reflect the different levels of understanding that people have and to enable us to onboard new managers. 

Outcomes

We’re very happy with the level of engagement that we’ve seen from most of our engineering managers. The usage data that Plandek shares with us shows a healthy level of regular engagement, with most managers logging in at least once a Sprint. Our most engaged team members are setting an incredibly high bar with potentially dozens of sessions each Sprint, showing just how engaging team data like this can be.  

With this, we’ve seen an increase in awareness of work-in-progress and cycle time across all levels of the organisation, and the outcomes we’d expect from it; teams are feeling less stressed, have more control over their workloads, and all those other great things. The benefits are yet to fully filter through to visible impacts on the wider delivery picture, but these things take time and are relatively well understood, so we’re happy being patient! 

One of our squad leads noted: “Having clear visibility of the cycle time also allowed the team to be more aware of the way that we split the tickets, the type of tickets that are of the type spike”. This is a great example of awareness of the data leading naturally to the right kind of (agile) change without having to educate people about it.

The initiative has also helped to raise the profile of these metrics such that people naturally talk about them in different contexts, and without any prompting. It’s great to see epic cycle time being a consideration when teams are weighing up options in their system design reviews. Similarly, our executive and senior leadership teams have started to adopt the same vocabulary which is both a great sign and further accelerates adoption. Before these were only concepts; now they are actual tangible measures people can understand them much better, and action follows understanding. 

Adoption Challenges

Naturally, it hasn’t all been perfect or easy, though. The two greatest challenges were competing for people’s time, and people’s engagement being variable.  

First, everyone sees themselves as “busy”; some, as the saying goes, “too busy to improve”. That meant positioning this as something relatively lightweight that can help them make more space for thinking quickly was a priority.  We did this by cutting the dashboard of metrics we asked people to look at down to what we thought was the absolute minimum that was useful. Plandek has over 30 different metrics, but we’ll get to more in time. As people started exploring the subset we started with they naturally started experimenting with the other insights on offer.

Second, as we moved forward with the cohorts, not everyone in each cohort was moving at the same speed, which sometimes led to disjointed conversations.  As mentioned above, we’re going to rebalance the cohorts as we progress through Q1 2023 to realign people with the level of support they will benefit most from. We still think starting off with the highly mixed cohorts was the right thing to do to align with the message that this was not about oversight; now that this has been somewhat proven we can think about how we move forward.

Learnings

Our biggest learning was undoubtedly just how critical the underlying culture enhancements were. From the outset, we were conscious and explicit that we didn’t want to “roll out a tool,” but instead the “thinking process” around team metrics. But it was only after our first proof of concept that we realised that we needed to dig even deeper and focus on the foundations and build the notion of delivery metrics into our culture, as we have with service and product metrics. 

We did this mainly by being clear about what we called the initiative (that is not “Plandek roll out”) and then front loading the cohort sessions with theory and explorative discussion before we got to using the tool.  The small frequent cohort groups allowed people to become comfortable with the concepts. Had we approached this as a tool rollout rather than culture change, we could have easily just had a few large one-off training sessions to “get people started on the tool”.

From there, it all takes more time and effort than we had assumed. There is value in making things simple; it’s not about having many metrics and many advanced ways of using them, it’s about getting smart people to understand the why, the value, and giving them easily accessible tools, arming them with enough understanding to get started then supporting them and getting them to share their successes. 

About the Authors

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.