Article originally posted on InfoQ. Visit InfoQ
Key Takeaways
- SRE adoption is greatly influenced by the organizational culture at hand. Therefore, assessing the organizational culture is an important step to be done at the beginning of an SRE transformation.
- The Westrum model of organizational cultures can be used to assess an organization’s culture from the production operations point of view. The six aspects of the model – cooperation level, messenger training, risk sharing, bridging, failure handling and novelty implementation – relate directly to SRE.
- Westrum’s performance-oriented generative cultural substrate turned out to be a fertile ground for driving SRE adoption and achieving high performance in SRE.
- Subtle culture changes in the teams during SRE adoption accumulate to a bigger organizational culture change where production operations is viewed as a collective responsibility because different roles in different teams are aligned on operational concerns.
- Both formal and informal leadership need to work together to achieve the SRE culture change providing consistency, steadiness and stability amidst the very dynamic nature of the change at hand.
Introduction
The teamplay digital health platform and applications at Siemens Healthineers is a large distributed organization consisting of 25 teams owning many different digital services in the healthcare domain.
The organization underwent an SRE transformation, a profound sociotechnical change that switched the technology, process and culture of production operations. In this article, we focus on:
- How the organizational culture was assessed in terms of production operations at the beginning of the SRE transformation
- How a roadmap of small culture changes accumulating over time was created, and
- How the leadership facilitated the necessary culture changes
The need to assess the organizational culture
When it comes to introducing SRE, it is easy to jump into the tech part of the change and start working on implementing new tools, infrastructure and dashboards.
Related Sponsored Content
Undoubtedly necessary, these artifacts alone are not sufficient to sway an organization’s approach to production operations. An SRE transformation is profoundly a sociotechnical change.
The “socio” part of the change needs to play an equal role from the beginning of the SRE transformation.
In this context, it is useful to assess the organization’s current culture, viewing it from the lens of production operations. This holds the following benefits:
- a) It enables the SRE coaches driving the transformation to understand current attitudes towards production operations in the organization
- b) It reveals subtle, sometimes hardly visible, ways the organization operates in terms of information sharing, decision-making, collaboration, learning and others that might speed up of impede the SRE transformation
- c) It sparks ideas about how the organization might be evolved towards SRE and enables first projections of how fast the evolution might go
Given these benefits, how to assess the organizational culture from the production operations point of view? This is the subject of the next section.
How to assess the organizational culture?
A popular topology of organizational cultures is the so-called Westrum model by Ron Westrum. The model classifies cultures as pathological, bureaucratic or generative depending on how organizations process information:
- Pathological cultures are power-oriented
- Bureaucratic cultures are rule-oriented, and
- Generative cultures are performance-oriented
Based on the Westrum model, Google’s DevOps Research and Assessment (DORA) program found out through rigorous studies that generative cultures lead to high performance in software delivery. According to the Westrum model, the six aspects of the generative high performance culture are:
- High cooperation
- Messengers are trained
- Risks are shared
- Bridging is encouraged
- Failure leads to inquiry
- Novelty is implemented
These six aspects can be used to assess an organization’s operations culture. To approach this, the six aspects need to be mapped to SRE in order to understand the target state of culture. The table below, based on my book “Establishing SRE Foundations“, provides this mapping.
|
Westrum’s generative culture |
Relationship to SRE |
1. |
High cooperation |
SRE aligns the organization on operational concerns. This is only possible if a high cooperation is established between the product operations, product development and product management. Executives cooperate with the software delivery organization by supporting SRE as the primary operations methodology. This is necessary to achieve standardization leading to economies of scale justifying investment in SRE. |
2. |
Messengers are trained |
SRE quantifies reliability using SLOs. Once corresponding error budgets are exhausted, the teams owning the services are trained on how to improve reliability. Moreover, the people on-call are trained to be effective at being on-call, which includes acting quickly to reduce the error budget depletion during outages. Postmortems after outages are viewed as a learning opportunity for the organization. |
3. |
Risks are shared |
Product operations, product development and product management agree on SLIs that represent service reliability well from the user point of view, on SLOs that represent a threshold of good reliability UX and on the on-call setup required to run the services within the defined SLOs. This leads to shared-decision making on when to invest in reliability vs. new features to maximize delivered value. Thus, the risks of the investments are shared. |
4. |
Bridging is encouraged |
SLO and SLA definitions are public in the organization, so is the SLO and SLA adherence data per service over time. This leads to data-driven reliability conversations among teams about reliability of dependent services. An SRE community of practice (CoP) is cross-pollinating SRE best practices among the teams and organizing organization-wide lunch & learn sessions on reliability. |
5. |
Failure leads to inquiry |
Postmortems after outages are used for blameless inquiry into what happened with a view to generate useful learnings to be spread throughout the organization. |
6. |
Novelty is implemented |
New insights from ongoing product operations, outages and postmortems lead to a timely implementation of new reliability features prioritized against all other work according to error budget depletion rates. |
With the target culture state defined in the table above, the SRE coaches can analyze how far away from it their organization currently is.
Accumulating small culture changes over time
When the SRE coaches understood the status quo, we began the SRE transformation activities. These will include technical, process and behavior changes. To fuel the movement, the SRE coaches need to look for small behavior changes, celebrate them and stagger them in such a way that they accumulate over time.
For example, the following order of small changes can incrementally lead to bigger behavior changes over time pushing the culture more and more toward the target state outlined in the previous section.
# |
Change |
Culture impact |
Culture impact accumulation over time |
1 |
Putting SRE on the list of bigger initiatives the organization works on |
Awareness of SRE and its promise at all levels of the organization |
Acceptance of potential usefulness of SRE, open-mindedness to SRE |
2 |
Establishing SRE coaches |
Perception of SRE as a serious bigger initiative being driven by dedicated responsibles throughout the organization |
SRE go-to people are known in the organization |
3 |
Setting initial SLOs |
The first reliability quantification is undertaken; new thinking of reliability as something being quantified is induced |
SRE has its concepts. The central concept of SLO is now something we define for our services |
4 |
Reacting to alerts on SLO breaches |
Developers no longer do only coding but also spend time monitoring their services in production |
Breaching the defined SLOs leads to alerts that developers spend time analyzing. Thus, the SLOs need to be very carefully designed to reflect the customer experience. Lots of SLO breaches lead to lots of time being spent on their analysis! |
5 |
Setting up alert escalation policies |
An SLO breach alert is so significant that it must reach someone who can react to it |
Reaction to an SLO breach needs to happen in a timely manner, otherwise an escalation policy kicks in! |
6 |
Implementing incident classification |
Incidents need classification to drive appropriate mobilization of people in the organization |
Mobilizing people to troubleshoot an incident happens depending on the incident classification |
7 |
Implementing incident postmortems |
Incidents warrant spending time on understanding what really happened, why and how to avoid the same incident from happening again in future |
Incidents do not just come and go. Rather, they are carefully analyzed after being solved, inducing a learning cycle into the organization |
8 |
Setting up error budget policies |
Error budget consumption is tracked. Once it hits a certain threshold, it becomes subject to a predefined policy of action |
Lots of SLO breaches can accumulate to significant error budget consumption. There is a policy to ensure the error budget consumption does not exceed some thresholds |
9 |
Setting up error budget-based decision-making |
Prioritization decisions about reliability are based on data from production tracking the error budget consumption over time |
Different people at different levels of the organization use the error budget consumption data to steer reliability investments |
10 |
Implementing organizational structure for SRE |
SRE is so widely established in the organization that a formal structure with roles, responsibilities and organizational units is established |
SRE is a standard operations methodology now that is even reflected in organizational structure and processes |
The culture changes outlined in the table above are driven using an interplay of formal and informal leadership. These dynamics are described in the next section.
Interplay of formal and informal leadership
In every hierarchical organization, there are leaders who possess formal authority due to their placement in the organizational chart. If these leaders are trusted by the broader organization, they enjoy a multiplication effect on their efforts thanks to a large following of people in the organization.
At the same time, lots of hierarchical organizations have informal leaders who do not possess formal authority because they do not have a prominent place in the organizational chart. They have, however, earned trust from the overall organization. This trust enables them to also enjoy a multiplication effect on their efforts because a large number of people in the organization follow them voluntarily.
In the table below, formal and informal leadership types are summarized.
|
Supporting the SRE transformation |
Detrimental to SRE transformation |
Supporting the SRE transformation |
Leadership type |
Formal leadership enjoying trust from the organization |
Formal leadership without trust from the organization |
Informal leadership enjoying trust from the organization |
Following type |
A large following of people in the organization, which is both voluntary and authority-based |
A following of people based on formal authority |
A large voluntary following of people in the organization |
A good combination of leadership described on the very left and very right columns provides the necessary environment to push SRE through the organization appropriately proportionally in the top-down and bottom-up manner. It caters for required consistency, steadiness and stability in the very dynamic nature of the SRE transformation. The teams feel that the formal leadership supports SRE while informal leaders help drive the necessary mindset, technical and process changes throughout the organization. This maximizes the chances of success for the SRE transformation.
From the trenches
The culture assessment method described above helped the Siemens Healthineers digital health platform organization successfully evolve operations towards SRE. In this section, we present a few real learnings from the trenches of our SRE transformation.
Learning 1: Involve the product owners from the beginning
One of the most profound things we got right was to involve the product owners in the SRE transformation from the beginning. The SRE value promise for the product owners is to reduce customer escalations they might experience due to the digital services not working as expected. The escalations are annoying, time-consuming and causing unwanted management attention. This provides motivation to the product owners to attend SRE meetings where the SLOs are defined and associated processes are discussed.
The product owners in SRE meetings:
- Provided context of the most important customer journeys from the business point of view
- Assessed the business value of higher reliability at the cost discussed in the meetings
- Got closer to production operations by being involved in SRE discussions from the start
- Developed an understanding of how to prioritize investments in reliability vs. features in a data-driven way
Learning 2: Get the developers’ attention onto production first
The major problem with organizations new to software as a service is that developers are not used to paying attention to production. Rather, traditionally their world starts with a feature description and ends with a feature implementation. Running the feature in production is out of scope. This was the case with our organization at the beginning of the SRE transformation.
In this context, the most important impactful milestone to achieve at the beginning of the SRE transformation was to channel the developers’ attention onto production. This was an 80/20 kind of milestone, where 20% of the effort yields 80% improvement.
It was less important to get the developers to be perfect about their SLO definitions, error budget policy specifications, etc. Rather, it was about supplying the developers with the very basic tools and the initial motivation to move their attention to production. Regularly spending time in some production analyses was half the battle when acquiring the new habit of operating software.
Once there, the accuracy of applying the SRE methodology could be brought about step by step.
Learning 3: Do not fear letting the team fail fast at first
When it comes to the initial SLO definitions, our experience was that teams tended to overestimate the reliability of their services at first. They tended to set higher availability SLOs than the services have on average. Likewise, they tended to set stricter latency SLOs than the service can fulfill.
Convincing the teams at this initial stage to relax the initial SLOs was futile. Even the historical data sometimes did not convince the teams. We found that a fail fast approach was actually working best.
We set the SLOs as suggested by the teams, without much debate. Unsurprisingly, the teams got flooded with alerts on SLO breaches. Inevitably, the big topic of the next SRE meeting was the sheer number of alerts the team cannot process.
This made the team fully understand the consequences of their SLO decisions. In turn, the SLO redefinition process got started. And this was exactly what was needed: a powerful feedback loop from production on whether the services fulfill the SLOs or not, leading to a reevaluation of the SLOs.
Learning 4: Build a coalition of formal and informal leaders
We found it very useful to have a coalition of formal and informal leaders championing SRE in the organization. The informal leaders were self-taught about SRE and bursting with energy to introduce it in the organization. To do so, they required support from the formal leadership to commit capacity in the teams for SRE work.
The informal leaders needed to sell SRE to the formal leaders on the promise of reducing customer escalations due to service outages. These conversations happened with the head of R&D and head of operations. In turn, these leaders needed to sell SRE to the entire leadership team so that the topic gets put onto a portfolio list of big initiatives undertaken by the organization.
Once that happened, there was a powerful combination of enough formal leaders supporting SRE, SRE being on the list of big initiatives undertaken by the organization and an energized group of informal leaders ready to drive SRE throughout the organization.
This organizational state was conducive to achieving successful production operations using SRE!
Summary
An SRE transformation is a large sociotechnical change for a software delivery organization that is new to or just getting started with digital services operations. The speed of the change is largely determined by the organizational culture at hand. It is people’s attitudes to and views about production operations that are the highest mountains to move, not the tools and dashboards used by the people on a daily basis.
Therefore, assessing the organizational culture before embarking on the SRE transformation is a useful exercise. It enables the SRE coaches driving the transformation to understand where the organization currently is in terms of operations culture. It further ignites a valuable thinking process of how it might be possible to evolve the culture towards SRE.