Six Recommendations for Resolving a Major IT Incident
Your operations manager has discovered an anomaly in your security system. The business will start to suffer within 15 minutes if it is a major IT incident. What should she do? We have some definite recommendations for managing major incidents.
1. Define a Major Incident
Before your operations manager can determine whether the incident is critical, she has to have a definition for comparison. There is no official definition, so your organization has to have its own. ITIL recommends using three criteria:
• Urgency: Effect on important business deadlines
• Impact: Impact to the business’s finances, reputation and viability
• Severity: Impact to end users, including employees and customers
Share the definition with your operations managers and major incident managers, and put them through training and practices so they’re ready when they’re under pressure.
Bottom line: If you don’t define a major incident, you’re setting up your resolution team for failure.
2. Establish Incident Processes
So your operations manager determines that it is a major incident. Now what? If she has to decide what the next step is, those 15 minutes to business impact will tick away very quickly.
Establish a very clear process, and share it with everyone who might have to use it. Key ingredients include:
• Establish resolution processes for less critical issues so your major incident managers don’t receive false alarms.
• Encourage all people who touch the resolution process to update their contact information in xMatters, and set up groups with on-call and vacation schedules.
• Automate who should receive alerts, and practice the resolution process so everyone knows what’s happening at each stage in the process.
Bottom line: Establish a clear process and make sure everyone knows how to execute it.
3. Be Transparent with Closed-Loop Integrations
NBN reduced follow-up calls to the service desk by 75% by proactively communicating to customers and other key stakeholders. – NBN Case Study
Businesses like to showcase a high number of integrations, but sometimes the depth matters more than the number. When ServiceNow sends an alert, xMatters automatically logs all the communication events in the originating ServiceNow system. So no matter who is reviewing logs, everything that happened is transparent. This helps both during the incident and for a post-morten after an incident.
Companies that try to hide service disruptions usually pay a price in reputation, customer satisfaction, and revenue. Communicate proactively to customers and executives, but customize your communications. A notification to an engineer might say, “Server SR-DB2a is down,” but the same notification to a customer might say, “We’ve had a mechanical failure.”
Proactive communications will limit the number of inbound calls from customers, saving time and money.
Bottom line: Let resolvers resolve.
4. Use Effective Communication Channels
A lot of companies rely on email and SMS. Emails come in incredible volumes, and they get buried in subfolders. SMS messages can be disrupted by wifi limitations and character limits.
We recommend pushing notifications through the xMatters app to ensure delivery. We also recommend sending notifications appropriate for the receiving device. Limit your character count for text messages, and watch who you’re copying on emails. Avoid mass notifications, which involve too many people and use the same message for everyone (see number 3 above).
When you assemble a resolution team on a conference bridge, we recommend the xMatters one-click conference bridge. You can target notifications to just the team members, make joining fast and easy, and maintain metrics on who joined on what device.
Bottom line: Targeted notifications can save time and keep the process on track.
5. Make Device Usability Count
Rows of engineers on laptops in a NOC is so 2005. Today your resolution team members must be able to target individuals, groups, their on-call and vacation schedules, devices, messages, and content from any device. Usability on any device saves time and enables everyone to contribute.
Bottom line: Major incident resolution relies on mobility. Enable any device to contribute.
6. Measure and Improve (Always)
Your client’s project is never complete. You can always improve response times at crucial junctures in the major incident process. For example, when your monitoring solution discovers a potential critical incident, is there a lag before your operations manager receives it and notifies a major incident manager? Can you reduce even a few minutes based on time to initial notification, time to response, or time to assemble a team? It could be the difference between meeting SLAs and missing them.
Bottom line: Businesses usually measure incident response in the immediate aftermath of an event. Run metrics more often to aggregate data and get a more holistic look at where you can improve.