Outages, The Blame Game, and Solutions

Computerworld.com has a great article entitled “Who do you blame when IT breaks?” by Patrick Thibodeau.

The article discusses a recent study by the Uptime Institute that reported the responsible parties in data center outages over a medium size sample group. The largest contributor to outages, according to the report? Vendors. From the article:

External forces who work on the customer’s data center or supply equipment to it, including manufacturers, vendors, factory representatives, installers, integrators, and other third parties were responsible for 50% to 60% of the incidents reported in those years, according to Uptime.

That’s not to say that internal sources were not to blame. On the contrary, “34% of the abnormal incidents in 2009 were attributed to operations staff, followed by 41% in 2010, and 40% last year.” But for those who have worked in IT, a pattern seems to be emerging from the data.

Obviously, you absolutely must find good vendors if you’re going to be successful and my book, Information Age Management: How to Increase Productivity by Getting the Best Out of Contractors, Vendors, Telecommuters, and Other Geeks in a Rapidly Changing Workplace devotes time to finding and taming good vendors. But there’s another point that is worth noting, and it’s one I touch on in the appendix to my book. You see, the Uptime Institute didn’t ask whether the companies being surveyed use proprietary technologies vs. open technologies with appropriate licensing. The reason this is important is twofold:

  • When a company uses appropriately-licensed open technologies, it is not at the mercy of the vendor to patch it, prevent incidents, nor to fix incidents when they come up. Granted, they may *choose* to do so but if it’s late at night and push comes to shove, a company that uses open tech can write and fire off a patch in no time, instead of waiting for the vendor to send a representative from another city or state. This means that you can eliminitate a large amount of that 50-60% downtime caused by a vendor bottleneck.
  • Open-source software is easier for the internal staff to understand. Its openness means it is more likely to be standard, which means it probably won’t be a one-off technology that only the vendor has been trained in. Also, the inner workings of the router/server/daemon/etc. will be visible to your internal IT staff so your internal team will be less likely to make errors when using the equipment and will be more likely to fix any errors they do make before they turn into outages.

From a technological standpoint, it’s a win-win. From a financial perspective, too. Think about it: if you do end up doing some of the vendor’s job in the middle of a crisis, so what? You’d be paying your sysadmin overtime during a crisit anyway so you might as well pay him to be productive. You’ll save money by fixing the problem sooner rather than later and you’ll have leverage at your next contract negotiation with your vendor.