Background
The company's infrastructure was providing product availability of about 99.78% during the year before the intervention. Although this level of availability appears to be quite acceptable and even admirable, given that there are about 500,000 minutes in a year, a 0.22% failure rate implies that a product at this level is not available an average of about well over an hour a month. Many of the products were used globally and continuously and therefore this level was determined to be unacceptable. When previous leadership was asked to commit to an availability target, the response was 97%, not only far below the actuals but far below a reasonable level of acceptability.
There was some apparent good news: the infrastructure team had implemented a process by which failures were investigated and the results were placed in a “root cause” repository.
Decisions made in the first 100 Days
On the first day a new cultural expectation for the entire IT organization involved in delivering and supporting products was stated: “There is only one standard of performance - Flawless Execution.” I went on to say, “There is no Murphy and no Murphy's Law and the notion that Sh*t happens has no place in our organization.”
When I personally reviewed the repository containing the root cause analyses, I found that more than 60% contained absolutely no valuable insight as to what failed, why it failed and what needed to be done to correct the root cause. Most of the entries were along the line of, “Rebooting the server solved the problem and restored the system.” This was incredibly disappointing because it meant that there was very little useful information on which to build improvement plans.
At that point, a decision was made to assign two of the organization's best resources to investigate every product outage from that point forward. Each investigation went into extreme detail to determine what exactly had happened and why. It also contained information on what the root cause(s) was and what needed to be done to fix the root cause and when it would be corrected. One of the two resources was selected for his detailed understanding of the company's infrastructure; and his intense focus. In fact, his nickname became “The Bulldog. ” The other resource was selected for his understanding of strategic technology and what needed to be implemented to prevent further occurrences. We called him the “Strategist. ” The Bulldog was assigned to the Strategist who began reporting directly to me.
These two resources began to have an immediate and positive affect; even though there was no significant decrease in the number or durations of the disruptions. The effect was an acknowledgement by the organization that the organization had indeed changed and the commitment to improving product availability was extremely serious. The concepts of “availability” and “resiliency” were also refined with the observation that “Availability is always ‘Plan A.’” However, Plan B is to create product environments that are as “resilient” as possible so that if a product fails, its functionality is restored as soon as possible. Improved availability and much greater resiliency became the twofold manifestation of the new initiative.
In addition to the immediate focus on understanding the root causes of every outage and degradation, a four step plan for significantly improving product availability and resiliency was developed. The steps were:
The first product to undergo this process had revenues in the tens of millions of dollars per year as was selected because it had experienced an 8 hour outage before the Strategist and the Bulldog were put in place. Although the four steps took about 5 months and about $500,000 to complete, the product has operated with 100% availability for the entire period since the new design was implemented.
In the first year after the intervention, the availability of the Top 15 products was 99.93%. In the next year the number of products tracked was increased to 130 and the availability was greater than the target of 99.95%. In the third year the numbers of products tracked was increased to 150.
Fortunately, the cultural affirmation of “flawless execution” continues to influence the IT organization's behavior.
Copyright © 2019 DMC Companies - All Rights Reserved.
949-872-3560 Dave@MgmtTechSolutions.com