11 minute read

Managing Incident Response Like A Pro

this-is-fine

In a world where risks seemingly grow at an exponential rate, an engineering organization’s ability to detect and respond to incidents is critical to its survival and success. But how do we define an ‘incident’? And what does it mean to be ‘prepared’ for one? In this article, we hope to answer some of these fundamental questions, and more. We believe sharing this information is beneficial for engineering teams everywhere looking to improve resiliency, reduce incident severity/downtime, and increase confidence in their product.

The goal of this article is to outline incident response best practices collected and implemented at Alegion. Our response plan has been forged from input provided by industry experts, as well as plenty of hands on experience managing incidents while growing our platform over the years.

Why Incident Response Matters

A good analogy for incident response in the physical world are first responders. Paramedics, fire fighters, and police officers all fill critical roles in society. Their services are extremely time-sensitive, where delays cost lives and money. They are specialists who practice frequently and can handle a variety of situations by following predefined game plans. There are different roles, and the type of incident dictates the form and shape of the team that responds.

Fortunately, in information technology, most of us do not work with systems that have life-and-death criticality. That doesn’t mean we can’t learn something from first responders in how they organize, prepare, and handle problems.

Intended Audience

Anyone involved in ops, development, security, or product management can benefit from reading this article. As you will see, good incident response requires bidirectional support between different business functions – there is plenty of room for everyone to be involved and informed.

Whether your team is struggling to keep up with constant firefights, or if you already have a strong incident response game, there should be something here for everyone.

 

What Is An Incident?

The time to define what is and is not an incident is not when something anomalous is happening. That is too late. Take some time after reading this section to consider incident categories that apply to your business model. Then, share that distilled information with relevant stakeholders. An example document is linked below.

At a high level, there are three broad categories of incidents.

1. Regression Or Bug

A deployment has introduced an undesirable side effect.

2. Security Breach

Evidence exists of unauthorized access, misconfiguration, malicious activity, or other related concern.

3. Performance Degradation

Part of the infrastructure has become a bottleneck and is affecting end user experience. Alternatively, the platform is at risk of failing its service level objectives, potentially triggering complications with clients.

Example: Incident Checklist

One example of a document outlining types of incidents is Alegion’s incident triage checklist, which is designed around the information security CIA triad of confidentiality, integrity, and availability. Checking any box on the incident triage checklist is a strong indication that you are currently experiencing an incident.

 

Create An Incident Response Plan

Typically, if your company is in compliance with SOC or similar security auditing regime, you’ll already have an incident response plan in place. If so, review it and ensure the plan is relevant and up to date.

If no incident response plan is in place, do some research and create one as soon as possible. Below are critical elements of a successful incident response plan.

1. Defined Response Steps

It’s important to keep everyone on the same page as far as the status of the response. One strategy to help with that is to define stages of the response, which will quickly communicate the rough state of the response.

  • Prepare – what you are doing right now.

  • Detect – logging, monitoring, observability.

  • Respond – targeted changes, hypothesized to fix the problem.

  • Recover – retro and develop action items.

2. Responder Roles

Successful incident response requires defined roles for the responders. Role titles we have defined at Alegion include:

Role

Description

Lead

Incident commander, and the individual ultimately accountable for response effectiveness.

Coordinator

Incident admin, responsible for fulfilling the needs of the response team. Provides supplies as needed (food, equipment, responder requests).

Responder

The individuals who are actively responding to the incident.

Scribe

Documents important timeline events, including decisions made and key players involved.

Communicator

Communicates to the rest of the org while the incident is ongoing. Communicators provide status updates through pre-defined communication channels (email, slack, etc), and state when to expect the next update. Communicators also field questions from the org in order to isolate the rest of the team and allow them to focus on the response.

3. Prioritization

Not all incidents have the same severity or prioritization. Define priority levels for your org, and acceptable response times for the different priorities. You may not know the priority right away, so handle reports with appropriate urgency until triaging can occur. The table below is an example of defined priority levels:

Priority Level

Acknowledge

Triage

Remediate

Retro

P1

30 minutes

1 hour

24 hours

36 hours

P2

4 hours

12 hours

3 days

3 days

P3

24 hours

24

7 days

7 days

This is similar to, but not exactly like, defining service level objectives, where a given metric is assigned SLOs for a predefined unit of time. An example SLO would be: availability (the metric) is at a minimum of 99.5% (the service objective) over a one year period (the unit of time).

Read more about SLOs: https://en.wikipedia.org/wiki/Service-level_objective

4. Procedures For Reporting And Escalating Reports

Document what channels to use for incident reporting. Typically, email is too slow for effective incident response – if this is true for you, then say so in your policy and communicate that to your internal stakeholders. Require the usage of real time communication methods, such as: slack, a phone call, or an in-person report if co-located.

Any reported incident falls under the prioritization.

Further Reading: Stanford IR Guidelines

The Berkeley Information Security Office has a good guide for incident response planning, including a detailed list of components for incident response plans.

Conducting Incident Response

Make sure your plan includes how to actually conduct the response itself. This is crucial, because once a problem occurs many things typically happen at once, and events can quickly spiral out of control if the process isn’t defined and practiced.

Collect Evidence

When things are actively broken, it is tempting to shotgun possible solutions as fast as possible until service is restored. Avoid this temptation. It’s extremely easy, and in fact likely, that more harm will come from panic patching. Instead, have the response team use a private communication channel to triage, understand, and isolate anomalies. Share evidence, and wait to take action until consensus can be reached.

Communicate Proactively

An explanation of how and when communications occur should be part of the incident response plan. For example, the plan can dictate hourly updates to internal stakeholders over Slack, #incident-command, for the duration of an incident. #incident-command should then be restricted, so that chatter happens on other channels or in threads. The important announcements must be easy to find.

The incident communicator should use the predefined communication channels to announce:

  • when an incident is declared

  • timely updates, including when to expect the next update

  • when an incident is closed

 

Proceed In An Orderly Fashion

Discuss privately amongst the incident response team what to do and assign one specific responder to apply patches. The scribe should monitor this private team communication channel in order to document what was changed, when, and what impact the change had, if any. The communicator should craft updates based on a distillation of activity from this channel.

Incident Closure

Once the problem is fixed, we are done, right? Not exactly.

We still have a critical step to complete: a retrospective meeting needs to be conducted. This should happen ideally within days of incident closure, that way details are still fresh for everyone. The retrospective needs to cover a few key things:

  • Timeline – the scribe should retell the timeline, from the start of the anomaly, to its discovery, important decisions that were made, as well as any remediation steps that the response team took.

  • Blameless RCA – identify what the underlying problem is, as much as reasonably possible. One method to dig below the surface is asking the five whys, where you start with a high level question, ie “why did <x> break?”, and each answer forms the basis of the next question. By the end of five iterations, practitioners should be closer to the root cause of an issue.

  • Action items – don’t stop at determining RCA. We must learn, improve, and avoid repeating mistakes. Create action items and assign them during the retrospective. Get a commitment on when to expect updates. These action items should be documented in the main ticketing system the rest of the company uses. At this point you have everything needed in order to follow up later and check on improvement progress.

Conclusion

As information security professionals, our customers and internal stakeholders place a lot of trust in us. We have an enormous job to be prepared for, well, just about anything. In order to be successful in this role we need every resource at our disposal, when it’s needed. This kind of organization just doesn’t happen organically. You need to plan, document, practice, and retro every incident for process improvement. I hope I’ve convinced you of the value in taking these extra steps. Your customers and coworkers will thank you when things break and you are able to confidently and swiftly navigate operations back to normal.

Resources

Incident response training provided by:

Alegion’s Incident Response Checklist: https://github.com/Alegion/incident-triage-checklist

Information Security CIA Triad: https://en.wikipedia.org/wiki/Information_security#Key_concepts

Service Level Objects: https://en.wikipedia.org/wiki/Service-level_objective

Stanford IR Planning Guidelines: https://security.berkeley.edu/incident-response-planning-guideline

Five Whys: https://en.wikipedia.org/wiki/Five_whys