Instead of Dreading a System Crash, Schedule One and Learn to Avoid Them

Oct. 10, 2014, 8:45 PM UTC / Source: Entrepreneur.com

By Nisha Ahluwalia

According to a survey by CA Technologies, companies in North America and Europe lost more than $26.5 billion in revenue due to downtime, and that’s from 2010!

There are various ways to calculate the monetary cost of system outages but the damage to a company’s reputation is immeasurable. When Microsoft’s Azure cloud-computing service experienced a major outage recently, experts speculated that it could be a major blow to the software giant’s attempt to compete against rivals Google and Amazon.

Related: Safety Dance

Good CEOs and CIOs refuse to accept excuses for even small levels of downtime but it’s not easy to hit five nines of reliability. Nonetheless, no matter how complex a company’s systems and business, there are always ways to engineer and deliver higher reliability and quality of service. Below are the actions that CEOs need to take to boost their company’s reliability:

1. Stop waiting for an outage. Create one.

If you wait for a customer to do something that causes a failure, you’re too late. For example, Netflix has tackled unexpected outages using their “Simian Army,” a set of automated tools that test applications for failure resilience. However, for most companies, the best way to handle this is to keep it simple.

Encourage your ops and dev teams to schedule a recurring meeting and create outages manually. Injecting failure reveals implementation issues that reduce resiliency while proactively uncovering deficiencies that would otherwise be the root cause of an outage.

Scheduled outages build a strong collaborative culture simply by bringing teams together on a regular basis. Working together to fix artificial failures will combat the idea that an actual failure can be ignored or justified with explanations.

2. Create (and protect) time for learning

No good engineer fixes the same problems without learning in the process. Make sure the teams responsible for resolving incidents have time to work through comprehensive postmortems.

Empower your teams to analyze what worked and what didn’t, without forcing them to determine a root cause. All too often, human error is the focus of these conversations but that just isn’t healthy. Blameless retrospectives allow teams to uncover the real issues and make proactive adjustments.

Businesses want to move fast but resist the temptation to move onto other issues when systems resume running or when everyone agrees on a “root cause.” Invest the time needed to understand how your systems and teams work. See it as an opportunity for the contextual learning needed to make real-time decisions that will improve your company’s mean-time-to-resolution.

3. Treat your ops and dev teams like sales and marketing. They drive revenue.

If you didn’t support your sales teams with tools, training and incentives to hit their goals, people would think you were nuts. Despite their critical role in ensuring your customers are getting value from your company, ops and dev teams often get less attention than their customer-facing counterparts.

Give these employees the infrastructure and tools to achieve peak performance. That includes the latest operations management tools, time and resources for training and goals with incentives to meet them. If you don't provide them with necessary support and recognition, how can you expect them to deliver a high-value product with high availability?

4. Set a high bar for uptime

Even short periods of downtime have a material impact on your bottom line and market perception but once you’re committed to supporting your engineering teams, you’re in a much better position to set a higher bar for uptime. Build, buy or partner to get the technology and skill sets you need.

Unfortunately, many companies still use homegrown operations management systems without redundancy, and still use disparate tools and manual processes to meander through the incident lifecycle. A focus on reducing ops team costs instead of setting the right culture from the start simply doesn’t make sense. The time spent on fixes alone will quickly become a greater cost for your company. Your product and services will suffer as a result.

CEOs who understand the importance of reliability in today’s always-on world don’t wait until there’s an outage to improve operations. They don’t ignore the rich learning that come from resolving incidents. They don’t treat operations and development teams like the “back office.” The CEOs of highly reliable companies invest in their operations infrastructure, processes and people because they care about the growth of their business and the loyalty of their customers.

Nisha Ahluwalia