What Makes a Perfect Incident Management Checklist? We Asked the Experts!
The perfect incident management checklist doesn’t need to be a fantasy. In fact, it shouldn’t be! The perfect incident management checklist should cover several topics, be broken down into bite-size sections, and help team members quickly identify tasks that fall under their responsibility.
We asked our experts what should be included in the perfect incident management checklist. Here are their answers.
Don’t Overlook the Basics
An incident management checklist should always start at square one, and for most users, that’s an internet connection. According to our Senior Frontend Developer, you should always have these basics ready to go:
- A good headset
- A decent network connection
- Proper authentication allowances
- Awareness of what to check and where assets are located (e.g., links to logs, dashboards, and playbooks)
- Easily accessible scripts or dashboards that can quickly show high-level status
- Playbooks that define what to do in certain circumstances
Clear Roles and Responsibilities
During an urgent incident or crisis, you don’t want to spend time deciding who should be responsible for certain tasks. An incident management checklist that clearly outlines roles and responsibilities can be a huge time saver, but what else should be in this section? Our Engineering Team Lead suggests:
- The inclusion of an on-call stakeholder with the authority to make necessary choices to ensure resolution is possible
- The inclusion of communication leads for the creation of internal and external messaging
- The inclusion of on-call subject matter experts for every incident
Actionable To-Do Items
Once the administrative work is covered and the right people are in the right seats, it’s time to begin the to-dos. Specific actions may be dependent on the incident itself, but almost every incident management checklist requires the following actions, outlined by our Team Lead and Senior UI Developer:
Role assignment
- Inclusion of needed experts and stakeholders
- Communication of the incident status to the customer
- Inclusion of standard failover questions at certain timeframes (e.g., at 15 minutes, should we fail over?)
- Summarization of the incident after the issue has been mitigated
- Postmortem scheduling
The Specific Specifics
Whether it’s an attachment to your incident management checklist or page two, the during-incident specifics need consideration: this includes note-taking, root cause identification, and so much more. Our experts suggest identifying and recording:
- The date and time of the incident
- Applicable first responders
- Systems at fault
- Incident severity level
- Incident blast radius
- Estimated resolution date and time
- Planned communication rollout
- If the incident was repetitive (if yes, refer to previous incidents)
- If multiple incident reports were present (group if any)
- A current snapshot of health check systems, relevant monitoring, and logging systems
- Possible playbooks that can revert systems
- On-code and deployment issues
- The timeline of events starting from the discovery of the incident to the resolution
- Relevant postmortem information (e.g., if the postmortem is planned and if meeting minutes and action items are attached from the incident postmortem to take preventative steps for the future)
- The updated service track record with the attached incident, and set incident without days to zero
- The stakeholders updated as per the communication rollout plan
Incident management checklists are always a work in progress. However, something that should always be part of your incident management process is a service reliability platform capable of helping you automate your response, integrate your tool stack, and accelerate your entire incident management process. Try xMatters for free and learn how xMatters can help.