Outage Response
Replacing Chaos with a Single Standard
Before: The Chaos
Before Brian Jolley standardized outage response at Western Governors University, there was no protocol. What happened during an outage depended on who noticed, who they knew, and who happened to be at their desk.
Detection was ad hoc. Someone on the problem management team might get a Slack message. Or an email. Or someone would notice on a ServiceNow dashboard that incoming calls had spiked. Sometimes it was a phone call — unless the phones were down. Occasionally alerts came through Salesforce tickets from departments that had entrenched in their own separate ticketing systems.
Escalation was a guessing game. Without a formal org chart or CMDB, problem managers had to know from memory which Software Engineering Manager was associated with each system. Some SEMs answered Slack. Some responded to email. Some you had to physically walk across the building to find — and if they weren’t at their desk, you had to figure out who was next in line.
Authority was inverted. SEMs outranked problem management agents. If an SEM disagreed that their system was the culprit, the problem manager had no mechanism to override that judgment — even while a major outage was underway.
Collaboration was improvised. There was no standard meeting point. Should someone start a WebEx? A Teams meeting? Every outage reinvented the coordination process from scratch.
Follow-up was inconsistent. No standardized metrics. No consistent tracking. Few post-incident reviews. The same failure modes repeated because nobody documented what happened or why.
After: One of Everything
Brian Jolley, Director of Service Management at Western Governors University, replaced the fragmented system with a single standard at every step.
One ticketing system. All outage-related tickets in ServiceNow. No more rogue department tickets in Salesforce.
One alert-and-acknowledge system. Automated alerts with a tracked acknowledgment requirement.
One escalation path. A defined ladder with time-bound intervals: SEM or first on rotation had 5 minutes to acknowledge, backup had 5 minutes, team had 5 minutes, director had 10 minutes, VP had 10 minutes, CIO had 10 minutes. Worst case, the CIO’s phone was buzzing within 45 minutes. Nobody wanted to be the reason the CIO answered an alert. That single design choice was enough motivation for most teams to respond promptly.
One communication channel. A single official chat channel for outage mitigation. No more hunting for a conference room or debating WebEx vs. Teams.
One standard set of metrics. Response time decomposed into five components — not a single “time to fix” number, but five measurements that revealed where the bottleneck actually was:
- Time to detect the outage
- Time to alert the responsible team
- Time to acknowledge the alert
- Time to identify the issue
- Time to restore service
Each failure mode requires a different fix. A single “time to resolve” metric hides the diagnosis.
One RCA process. Every outage got a documented root cause analysis with tracked action items and due dates — feeding directly into the Directors’ Closure Review.
One standard training module. Fire drills tested team readiness before real outages. The first time a team ran the protocol was not during an actual emergency.
The Results
Brian Jolley’s outage response standardization at Western Governors University, combined with the Scorecards system and Directors’ Closure Review, reduced high-severity outage duration from weeks to hours within four months. The protocol replaced a system where critical university services could be down for days with unclear ownership, no escalation authority, and no post-incident accountability.
Why It Worked
The elegance was the repetition. One of everything. When you’re in a crisis, you don’t want to think about which tool to use or which channel to join or who to call. The protocol eliminated every decision except the one that matters: what’s broken and how do we fix it.
Decomposed metrics exposed root causes. Breaking response time into five components meant leadership could see exactly where the process was failing and target fixes precisely. A monitoring gap is a different problem from a team that doesn’t answer alerts.
Fire drills built muscle memory. Teams practiced the protocol before they needed it. When a real outage hit, the response was a rehearsed process, not an improvised scramble.
The escalation ladder created urgency. The five-minute intervals and the CIO at the top of the chain meant that ignoring an alert was more uncomfortable than responding to it.
Connected Work
Outage Response is where Brian Jolley’s expertise in organizational fracturing was forged. The initiative connects directly to Scorecards (which used the MTTA metric originated here) and Directors’ Closure Review (which consumed the RCA outputs). The ownership clarity that made the escalation ladder meaningful eventually came from the Common Services Data Model.