MMS • Anil Kumar Ravindra Mallapur
Article originally posted on InfoQ. Visit InfoQ
Key Takeaways
- Incident management frameworks help organizations manage chaos during production outages and resolve incidents faster.
- The incident management lifecycle can be divided into 7 stages: Detect, Create, Classify, Troubleshoot, Resolve, Review, and Follow Up.
- Incident management processes require a dedicated set of folks performing different roles and responsibilities including incident managers, communications leads, on-call engineers, customer escalation managers, and executives.
- Dedicated incident management tools (Alert Management, On-call Management, Project Management, Incident Tracking, and Collaboration) enable efficient resolution of incidents and reduce toil.
- It is essential to track and measure key incident metrics, such as TTD, TTR, TTM, and Availability, to continuously evaluate the progress in the organization.
In this article, we provide an opinionated generic framework for effective incident management inspired by LinkedIn’s internal process that can be tailored to fit the needs of different organizations. There are standardized ITIL processes for Incident Management, but the following framework differs from that and is customized to resolving live production outages.
Most companies offer services online, and any outages entail poor end-user experience. Repeated outages can impact the business and brand value. Frequent production outages are expected in complex distributed systems with high velocity. Organizations should embrace the reality of incidents and create an incident management process to facilitate faster resolution times.
What are Incidents?
Incidents are unplanned production outages that significantly disrupt the end-user experience and require immediate organized intervention.
Incidents can be internal or external based on the impacted users.
- Internal Incidents – Outages that impact employee productivity due to issues within tools that are used to get their job done can be termed internal incidents (e.g., deployment tooling is not functioning for an extended duration, employees cannot log into the VPN).
- External Incidents – Outages impacting the end-user experience of a company’s products/services are termed external incidents (e.g., users cannot purchase items from an e-commerce website, and users are not able to send messages in messaging software).
The above incidents can be further divided based on severity into Minor, Medium, and Major.
- Major – Severely impacts the end-user experience for many users, and there is a clear impact on business due to revenue loss or brand value.
- Medium – Incidents impact a significant part of the service but are usually localized to a specific region, unlike major incidents.
- Minor – Incidents that impact the non-critical workflow of the service for a select few percentages of the users.
Consider a hypothetical example of the severity of incidents on a social media website. The service being unavailable for most users for more than 30 minutes can be classified as a major incident. In contrast, the direct message feature not working for users in the Middle East might be a medium, and the verified badge not showing up on users’ profiles for users in Indonesia might be classified as a minor outage.
It is highly recommended to consider business goals and establish strict data-based guidelines on the incident classification to promote transparency and prevent wasting engineering bandwidth on non-critical incidents.
What is Incident Management?
Incident management is the set of actions taken in a select order to mitigate and resolve critical incidents to restore service health as quickly as possible.
Incident Management Stages
Detect
Outages are proactively detected via monitoring/alerts set up on the infrastructure or by user reports via various customer support channels.
Create
Incidents are created for the detected outages triggering the initiation of the incident management process. Ideally, an organization can rely on a ticket management system similar to Atlassian’s JIRA to log incident details.
Classify
Incidents are then classified based on the established guidelines. It is highly recommended to draft these guidelines in alignment with business needs. There are multiple terminologies used across the industry today, but we will stick to the major, medium, and minor categorization to keep it simple. The incident management process and sense of urgency remain the same for all incidents, but identifying incidents helps prioritize when multiple incidents are ongoing simultaneously.
Troubleshoot
The incident is escalated to the oncall engineers of the respective service by the person who initially reported the incident to the best of their knowledge after consulting the internal on-call runbook. Escalations continue until the root cause of the issue is identified; sometimes, an incident may involve multiple teams working together to find the problem.
Resolve
As their highest priority, the teams involved focus on identifying the steps to mitigate the ongoing incident in the shortest amount of time possible. The key is to take intelligent risks and be decisive in the following steps. Once the issue is mitigated, teams focus on resolving the root cause to prevent the recurrence of the problem. Throughout the resolution process, communication with internal and external stakeholders is essential.
Review
The incident review usually takes place after the root cause identification. The team involved during the incident and critical stakeholders get together to review the incident in detail. Their goal is to identify what went wrong, what could be improved to prevent or resolve similar issues faster in the future, and identify short/long-term action items to prevent or improve the process/stack.
Follow Up
Incident action items are reviewed regularly at the management level to ensure all the action items related to the incidents are resolved. Critical metrics around incidents, such as TTD (Time To Detect), TTM (Time to Mitigate), TTR (Time To Resolution), and SLAs (Service Level Agreement), are evaluated to determine incident management effectiveness and identify the strategic investment areas to improve the reliability of the services.
Incident Management Roles & Responsibilities
A dedicated set of folks trained to perform specific roles during the incident is essential to successfully manage production incidents with minimum chaos. Ideally, people assume one function as the responsibilities are substantial and require particular skills. Roles can be merged and customized to fit the business needs and the severity of the incidents.
Incident Manager
The Incident Manager, referred to as IM for brevity in the document, is the person in charge of the incident, responsible for leading the incident to resolution with the proper sense of urgency. During an incident, a person should be responsible for the general organization of the incident management process, including communication and decisions. This person will be empowered to make decisions and ensure incidents are handled efficiently according to strategy.
Responsibilities:
The Incident Manager is responsible for four main aspects of incident management: organization, communication, decision management, and post-incident follow-up.
- The organization of an incident is paramount to efficient resolution.
- The IM will be responsible for pulling in the correct teams and stakeholders to ensure a quick resolution.
- The IM will work with stakeholders to ensure that work items raised during investigation and remediation are assigned and tracked.
- During an incident, many decisions need to be made.
- The IM is responsible for identifying inflection points between investigation and quick resolution and ensuring that decisions are made promptly and appropriate stakeholders are engaged/aware.
- The IM is empowered to judge who owns decisions when consensus cannot be reached during troubleshooting.
- After an incident, the IM is the communications point of contact for the incident.
- As the IM was actively involved in the incidents, they are responsible for leading the post-mortem in collaboration with service owners and stakeholders.
- IM will collaborate with service owners and present the incident overview and essential action items from the post-mortem to higher management.
On-Call Engineers
During an active incident, on-call engineers from impacted services and owning services are engaged to investigate and mitigate the issues responsible for the incident.
Responsibilities:
On-call engineers from affected services are responsible for evaluating the customer impact and service impact and validating the mitigation/resolution steps before giving the all-clear signal to close the incident.
Owning on-call engineers accountable for the service causing the outage/issues are responsible for actively investigating the root cause and taking remediation steps to mitigate/resolve the incident.
Communication Lead
Effective communication between stakeholders, customers, and management is critical in quickly resolving incidents. Dissemination of information to stakeholders, management, and even executives avoids the accidental compounding of incidents, helps manage chaos, prevents duplicate/siloed efforts across the organization, and improves time to resolution.
Responsibilities:
The Communications Manager is responsible for all the written communications of the incident to various internal and external stakeholders (employee & executive updates, social media updates, and status pages)
Customer Escalation Manager
In large companies that cater to a wide variety of enterprise customers with strict SLA requirements, it is common to have dedicated Customer Escalation Managers to bridge the communication between the customers and internal incident teams.
Responsibilities:
- Stay in contact with customers, collect details about ongoing incidents and relay the information to the internal team debugging the issue.
- Distill communication updates from the Communications Manager and regularly pass customized updates to customers.
- Identify mitigation steps for customers to try and mitigate until the full resolution of the issue is put in place.
Executives
Executives responsible for the services causing the customer impact are constantly updated on the incident status and customer impact details. Executives also play a crucial role in making decisions about the incident that may impact the business, routing resources to speed up the incident resolution process.
Incident Management Tools
Many tools are required at each stage of the incident management lifecycle to mitigate issues faster. Large companies roll out custom-built tools that interoperate well with the rest of the ecosystem. In contrast, many tools are available in the market for organizations that don’t need to build custom tools, either open-sourced or commercial. This section will review a few standard categories of essential tools for the incident management process.
Alert Management
Alert management helps set up alerts and monitor anomalies in time series metrics over a certain period. It sends notifications to on-call personnel to inform them of the abnormality detected in the operational metrics. Alert management tools can be configured to escalate the reports to on-call engineers via multiple mediums; a pager/phone call for critical and messages/email for non-critical alerts.
Alert management tools should support different mediums and the ability to interop with the observability tools such as Prometheus, Datadog, New Relic, Splunk, and Chronosphere. Grafana Alert Manager is an open-sourced alert management tool; PagerDuty, OpsGenie, and Firehydrant are some of the commercial alert management tools available in the market.
On-Call Management
In a large organization with thousands of engineers and microservices, engaging the correct person in a reasonable amount of time is crucial for resolving incidents faster. On-call management tools help share on-call responsibilities across teams with on-call scheduling and escalations features and service to on-call engineers mappings to enable seamless collaboration during large-scale critical incidents.
On-call management tools should support customizations in scheduling and service ownership details. PagerDuty and Splunk Oncall are some of the most well-known commercial options, whereas LinkedIn’s OnCall tool is an open-sourced version available for organizations looking for budget options.
Collaboration Tools
It is not uncommon to have hundreds of employees engaged during critical incidents. Collaboration and communication are essential to manage chaos and effectively resolve incidents. These days, every software company has messaging or video conferencing software that engineers can readily use to hop on a bridge and collaborate. Easy and fast access to information on which groups in messaging applications to join or which bridge to participate in the video conferencing software is critical in reducing the time to resolve incidents.
A separate channel for every incident discussion is vital to enable easier collaboration. Bridge links are usually pinned to the group chat’s description for new engineers to join the meeting. A well-established process reduces the noise of logistical questions such as “where should I join” or “can someone please share the bridge link” in the group chat and keeps the communication channel clear for troubleshooting.
Incident Tracking
Incidents generate vast amounts of critical data via automated processes or manual scribing of the data for future reference. Classic note-taking applications won’t go too far due to a lack of structure. A ticketing platform that supports multiple custom fields and collaboration abilities is a good fit. An API interface to fetch historical incident data is crucial.
Atlassian’s JIRA is used by many companies for all incident tracking, but similar tools such as Notion, Airtable, and Coda work equally fine. Bugzilla is an open-sourced alternative that can help with incident tracking.
Knowledge Sharing
Knowledge-sharing tools are essential for engineers to find the correct information with ease. Runbooks, service information, post-mortem documents, and to-dos are all part of the knowledge-sharing applications. Google Docs, Wikis, and Notion are all good commercial software that helps capture and share knowledge within the organization.
Status Pages
Status pages are a medium to easily broadcast the current status of the service health to outside stakeholders. Interested parties can subscribe to the updates to know more about the incident’s progress. Status pages reduce inbound requests to customer service departments regarding the system’s health when an external incident occurs.
Incident Response Lifecycle
In the last sections, we discussed different stages, roles, and tools in incident management. This section will use the above information and detail the incident response process stages.
Detect
Issues are detected by internal monitoring systems or by user reports via customer support or social media. It is not uncommon for internal employees to see the issue first and escalate it to the centralized site operations team. Organizations should adopt reasonable observability solutions to detect problems faster so that Time To Detect (TTD) metrics are as small as possible.
In case of user escalations, a process should be implemented for employees to quickly escalate the issues to the relevant teams using the available on-call management tools. Escalation of issues marks the beginning of the incident management lifecycle.
Create
The team collects the required information about the incidents and creates an incident tracking ticket. Additional details about affected products, start time, impacted users, and other information that may help engineers troubleshoot should also be captured.
Once the ticket is created, the on-call Incident Manager needs to be engaged using the internal incident management tool. A shared channel for communications in the internal messaging service and a video bridge for easy collaboration should be started.
Analyze
The Incident Manager works with the team to identify the on-call engineers for impacted services and collaborates with them to better understand the user impact. Based on the impact, the Incident Manager classifies the incident into major, medium, or minor. Major incidents are critical and would typically be an all-hands-on-deck situation.
Troubleshoot
Once the issue is classified as a major, a preliminary incident communication is sent out to all relevant stakeholders that a major incident has been declared and noting the available information about the incident. This initial communication lacks details but should provide sufficient context for recipients to make sense of the outage. The external status page should be updated, acknowledging that an issue is in progress and the organization is working on resolving the issue.
The Incident Manager should escalate the issue and engage all relevant on-call engineers based on the best available information. The communications lead will take care of the communications, and the customer escalation manager should keep the customers updated with any new information. The incident tracking ticket should capture all necessary incident tracking data.
If more teams are required, the Incident Manager should engage the respective teams until all the people needed to resolve the incident are present.
Resolution
Teams should focus on mitigating the incidents and finding the root cause and resolution later. In this case, the teams can explore options to redirect all the traffic from the affected region to available healthy regions to try and mitigate the issue. Mitigating the incident using any temporary means can help reduce the TTM (Time to Mitigate) of the incidents and provide much-needed space for engineers to fix the root cause.
Throughout the troubleshooting process, detailed notes are maintained on things identified that may need to be fixed later, problems encountered during debugging, and process inefficiencies. Once the issue is resolved, the temporary mitigation steps are removed, and the system is brought to its healthy state.
Communications are updated with the issue identified, details on steps taken to resolve the problems, and possible next steps. Customers are then updated on the resolution.
Post-Incident Review
Once the root cause is identified, a detailed incident document is written with all the details captured during the incident. All stakeholders and the team participating in the incident management get together and conduct a blameless post-mortem. This review session aims to reflect on the incident and identify any technology or process opportunities to help mitigate issues sooner and prevent a repeat of similar incidents. The timeline of the incidents needs to be adequately reviewed to uncover any inefficiencies in the detection or incident management process. All the necessary action items are identified and assigned to the respective owners with the correct priorities. The immediate high-priority action items should be addressed as soon as possible, and the remaining lower-priority items must have a due date. A designated person can help track these action items and ensure their completion by holding teams accountable.
Metrics to Measure
As it is said in SRE circles: “what gets measured gets fixed.” The following are standard metrics that should be measured and tracked across all incidents and organizations.
Time To Detect (TTD)
Time to Detect is the time it takes to detect the outage manually or via automated alerts from its start time. Teams can adopt more comprehensive alert coverage with fresher signals to detect outages faster.
Time To Mitigate (TTM)
Time To Mitigate is the time taken to mitigate the user impact from the start of the incident. Mitigation steps are temporary solutions until the root cause of the issue is addressed. Striving for better TTM helps increase the availability of the service. Many companies rely on serving users from multiple regions in an active-active mode and redirecting traffic to healthy regions to mitigate incidents faster. Similarly, redundancy at the service or node level helps mitigate faster in some situations.
Time To Resolution (TTR)
Time to Resolution is the time taken to fully resolve the incident from the start of the incident. Time to Resolution helps better understand the organization’s ability to detect and fix root causes. As troubleshooting makes up a significant part of the resolution lifecycle, teams can adopt sophisticated observability tools to help engineers uncover root causes faster.
Key Incident Metadata
Incident metadata includes the number of incidents, root cause type, services impacted, root cause service, and detection method that helps the organization identify the TBF (Time Between Failures). The goal of the organization is to increase the Mean Time Between Failures. Analyzing this metadata helps identify the hot spots in the operational aspect of the organization.
Availability of Services
Service availability is the percentage of uptime of service over a period of time. The availability metric is used as a quantitative measure of resiliency.
Conclusion
This article discussed the incident management process and showed how it can help organizations manage chaos and resolve incidents faster. Incident management frameworks come in various flavors, but the ideas presented here are generic enough to customize and adapt in organizations of any size.
Organizations planning to introduce the incident management framework can start small by collecting the data around incidents. This data will help understand the inefficiencies in the current system or lack thereof and provide comparative data to measure the progress of the new incident management process about to be introduced. Once they have a better sense of the requirements, they can start with a basic framework that suits the organization’s size without creating additional overhead. As needed, they can introduce other steps or tools into the process.
If you are looking for additional information on improving and scaling the incident management process, the following are great places to start:
Organizations looking to improve their current incident management process must take a deliberate test, measure, tweak, and repeat the approach. The focus should be on identifying what’s broken in the current process, making incremental changes, and measuring the progress. Start small and build from there.