Strategies To Avoid Catastrophic Downtime

Posted by Salim Virani on Aug 23, 2009 in Advice | 0 comments

Strategies To Avoid Catastrophic Downtime

Recently a friend’s entire business was offline for around 12 hours, and customers were without service for the first time since the business started, over 7 years ago. Even in this case, where the team is very talented and diligent, mistakes were made.

This is a learning opportunity for us all. The following strategies will help you make immediate improvements with the reliability of your IT infrastructure:

1. Always make the business decisions first.

If you’re finding your IT infrastructure tends to hold you back, rather than create opportunity, you’re likely making IT decisions that aren’t driven by your business strategy.

A common trap is thinking that maintaining software will be as cheap and easy as installing it. This trap sometimes disguises itself to leadership as, “That’s an IT issue – I’ll let them decide how to handle that,” or to IT as, “I can just install that quickly.”

In both cases, the IT decision doesn’t get aligned to the business strategy, and it bypasses the risk/reward process that all other business decisions go through. Not only is this needlessly dangerous, it sets your IT investments down a path that isn’t in line with your overall business.

2. Don’t host anything you don’t have to.

It may seem trivial to host your own DNS, email, web servers or another common service, but you’ll pay dearly when it’s a trivial problem that takes your service down.

As you have to maintain more software, you spread your resources thin and expose yourself to catastrophe from unpatched security flaws and version incompatibilities. When it comes to reacting in an emergency, managing all of these bits and pieces is going to exacerbate the problem, reducing your response time.

In the long run, it’s cheaper to use a company with dedicated experts, leaving your experts to focus on your core business. This applies to tech businesses too: if it’s not in your core offering, outsource it.

3. Don’t punish your users for your failures.

Degrade gracefully. Avoid unnecessary dependencies between systems.

In many cases, a single failure causes other functional systems to stop, causing more inconvenience than necessary.  If there’s a catastrophic failure in one area, plan on containing it.  Consider that minimising your users’ inconvenience may be more important than other factors (such as billing.)

For example, if you run a metered service with low marginal costs, billing or authentication failures should default to a minimal access level rather than no access at all. The loss of the unbilled revenue is likely much smaller than the loss in revenue from customer churn.

4. Communicate.

Don’t worsen your downtime by going dark on your communications as well. Have a plan to get the message out quickly and regularly. Letting your customers know ASAP helps them to minimise impact to them. Regular updates, whether you’re on track or not, help your users manage and maintains their confidence in you.

5. Monitor.

External monitoring services are so cheap, there’s no excuse not to use them. Management and customer services should be on the list for automatic notification. This frees up your techies to deal with the problem first, and the entire business can react immediately.

6. Be redundant, but not complicated.

In many cases, duplication is grossly overcomplicated and unnecessary. Redundancy needs to be simple.

Most business don’t have the time to regularly test their backup or failover infrastructure, so simplicity helps make sure it still works when you need it.  The IT world is full of horror stories of fancy RAID arrays and automated backup/restore plans gone wrong.

Consider what critical functions need redundancy, and within those, consider if reduced functionality will suffice. If so, you have a much simpler failover system to build and maintain.

As an example, consider REST with queuing. It may require some investment, but it has many additional benefits. Your services become far more scalable, more maintainable, and more flexible from a strategic point-of-view. Matt Biddulph, of Dopplr, has a great presentation on this here.

7. Be watchful AFTER a migration.

Migrations are a time when you are most exposed to human error and planning failures. Plan to have extra people watching and ready during and AFTER the migration.

Migrations tend to cause glitches that are tough to spot right away. Sometimes, caching or permissions causes serious glitches to go undetected immediately after a migration. Your tests will pass, but the problems will crop up later. So watch carefully afterwards. If you have any processes that occur on a regular interval (say monthly) then be vigilant until you’ve run through at least one complete cycle.

8. Make sure you have redundant humans.

Many small businesses are over-reliant on a single tech person. This might be a necessary risk when you’re bootstrapping, but it’s a risk that must be managed.

Don’t fool yourself into thinking you need another techy of equal skill, time or knowledge. Critical emergency functions rarely require detailed knowledge of the entire system. In many cases, a server admin just needs to know how to build a few servers and install your software. That’s enough to have a second pair of hands at the ready, just in case.

Break your emergency plans into small chunks, and document them in phases. You’ll find the most likely causes of failure are easy to document and quick for someone else to learn.

For server administration, I’ve used Green Olive Tree for years. They charge $100 per server per month, which includes pro-active 24/7 monitoring and an expert pair of hands constantly at the ready.

9. Make sure you have redundant suppliers.

Organisational failure is becoming more likely, and sometimes the signs aren’t clear.  Many businesses are stretching their people further, and that can lead to unforeseen problems if you’re a customer. ISPs may have bandwidth cut off, or reduce their tech support response times without notice. Don’t put all of your infrastructure in the hands of a single organisation.

10. Make sure you have redundant bandwidth and routing.

Monitoring and DNS failover is very cheap and practical. If you consider your absolute minimal requirements, you can usually configure a server or two (or even a shared hosting package) at a different ISP which uses a different backbone, and automatically failover from the monitoring system.

As we move more towards service-oriented marketplace, SLAs have become more than a legal and architectural consideration. The marketplace is becoming more crowded, so downtime will have a greater and greater effect on your credibility and marketability. That means the impact of potential failure scenarios is rising.

At the same time, the cost of addressing these risks is lowering. The recent advanced in technology such as cloud computing, REST, monitoring and so on, now allow you to reasonably address these scenarios.

If you take a look at your services with fresh eyes, you’ll see many new opportunities to expand quicker and avoid disasters.

Leave a Reply