The readers may be
wondering how backwater a Malaysian author can be to still write about Disaster
Recovery when we’re hitting 2013. Well, I figured that with the end of the
world hype, floods and political events in Thailand as well as the Japanese earthquake
and tsunami, the topic may just be in vogue again. Also, I’ve been working on a
Disaster Recover improvement solution and felt that there are still a number of
customers not sufficiently aware of what’s best for their organization.
What I really want to expound
is the lack of emphasis on the practical side of acting on a Business Impact
Analysis (BIA) result. I am assuming the readers are aware of a BIA and have
had professionally trained teams walk through the exercise with you; where the
CIO and business owners are able to identify IT services, applications and
infrastructure that are supporting critical
business function. Note; the IT piece
of the puzzle is NOT critical unless the business part is.
The outcome desired is
the reduction and mitigation of business impact, failing which a contingency is
in place. So like all good IT Outsourcer, the vendor has to advise on what’s
best for the business, and not how much money the outsourcer can make by
overselling hot site synchronous data replication as the solution for all IT
ailments.
The figure below
illustrates “anecdotal” evidence of risk probability versus event categories (once
readers pony up their SLA data to your’s truly and I’ve received at least 120
data points, I’ll be more than happy to provide an accurate statistical
sampling of the probabilities), you will see that the largest occurrence of
business disruption does not come from an actual Disaster incident but a Type A
risk – failures due to operational weaknesses.
Why boil it down to
operational weaknesses?
Simply; if you could
stress test and quality assure your applications you will reduce the amount of
catastrophic bugs that add extra zeroes into your accounting software, risking
customer lawsuits as well as incidents where an overworked operator ignored a
bunch of RAID drives with parity errors teetering at going blinkers on the 24/7
banking application.
I’ve personally went
through an incident in my previous incarnation that led to a 48 hour email
outage because of an operational change incident where no Disaster Recovery can
salvage.
Type B risks are much
easier to resolve because it is an elimination of single point of failures. For
example, a transaction server running 1,000 transactions a second, each
committing income to the company; but unfortunately, resides on a RAID-0 setup
with only a single core switch and a single instance server.
Although rare in this
day and age it may still be possible if you look deep enough into how your IT
services value chain is setup.
The figure above is
highly simplified but when you spread eagle the entire IT ecosystem, any one of
those single point of failures can lead to a money losing outage of critical
business function. It gets worst when there’s no parts sparing in place and the
last backup tape has never been tested for restoration. Suffice to say, the
organization won’t have type B risks if it didn’t have an IT outsourcer
exhibiting Type A behaviour.
Type C risks are
arguable; because organizations which have more direct relations with activities
that are deemed “sensitive or sensationalist” may suffer from external or even
internal attacks. Strangely, Malaysian and government link sites tend to suffer
a spike in hacking or defacing attacks during international level football
matches! (You will need to speak to the good folks at Malaysia CyberSecurity
for the details)
Finally type D risks; the
actual DR solution to mitigate a full blown disaster, is it necessary – YES.
But don’t do it because it is mandated by the central bank, do it because
you’ve done your homework. Before you spend money on the DR solution, you may
wish to allocate your hard earned IT budget to mitigate the biggest risk to
business operations. More importantly, find an honest IT Outsourcer that has
your interests at heart to work on the solution together.
To summarize, mitigate
Type A risks with a top notch IT Outsourcing team and adequate process
adherence. Mitigate Type B risks by eliminating single point of failures, Type
C risks with security hardening and lastly Type D, when you’ve done what you
can with all of the above.