In the news today Blackberry said "The messaging and browsing delays... in Europe, the Middle East, Africa, India, Brazil, Chile and Argentina were caused by a core switch failure within RIM's infrastructure" (Source: http://www.bbc.co.uk/news/technology-15243892).They also said "
"Although the system is designed to failover to a back-up switch, the failover did not function as previously tested," (Source: http://news.cnet.com/8301-30686_3-20118882-266/international-blackberry-outage-continues/)
This immediately causes me to ask a few questions:
- Why wasn't the fail over triggered manually?
- What was missed in the testing of the switches fail over?
- Was this an existing issue?
- When was the DR plan last tested?
- Had changes been made which invalidated the DR plan?
You also have to be absolutely aware of changes that are made which could affect your DR plans and this means every change has to be screened to ensure that you aren't creating an SPOF or that if you are then everyone is aware of it and plans are put forward to plug that gap.
The key in any major outage is to get the system back up, even if it means failing over manually - however, any steps taken to recover the service should be noted in an emergency change request of some description and once this is done and the systems have been recovered it is vital that the change notice is thoroughly reviewed to find out both what went wrong and what could go wrong because recovering from an outage is one thing but it's all for naught if that recovery leads to a potential problem which will bite you later on.
ITIL processes teach a lot of this and implementing these practices can be a pain but its a choice. You either suffer the pain of the paperwork or the pain of the outage.
At least if potential problems are known about they can be more easily dealt with when they appear and bite you and they will appear.
The mobile industry is very much a cut throat industry and this dual outage with Blackberry will do them no good at all because others will seize upon it as a sign of Blackberries weak infrastructure and they will be right.
To recover from this Blackberry need to do a through review of their systems and DR processes and ensure that if this happens again they have the ability to recover from it very rapidly. They are, after all, reliant on their userbase for their income and they have failed a major test.