In the news today Blackberry said "The messaging and browsing delays... in Europe, the Middle East, Africa, India, Brazil, Chile and Argentina were caused by a core switch failure within RIM's infrastructure" (Source: http://www.bbc.co.uk/news/technology-15243892).They also said "
"Although the system is designed to failover to a back-up switch, the failover did not function as previously tested," (Source: http://news.cnet.com/8301-30686_3-20118882-266/international-blackberry-outage-continues/)
This immediately causes me to ask a few questions:
- Why wasn't the fail over triggered manually?
- What was missed in the testing of the switches fail over?
- Was this an existing issue?
- When was the DR plan last tested?
- Had changes been made which invalidated the DR plan?
You also have to be absolutely aware of changes that are made which could affect your DR plans and this means every change has to be screened to ensure that you aren't creating an SPOF or that if you are then everyone is aware of it and plans are put forward to plug that gap.
The key in any major outage is to get the system back up, even if it means failing over manually - however, any steps taken to recover the service should be noted in an emergency change request of some description and once this is done and the systems have been recovered it is vital that the change notice is thoroughly reviewed to find out both what went wrong and what could go wrong because recovering from an outage is one thing but it's all for naught if that recovery leads to a potential problem which will bite you later on.
ITIL processes teach a lot of this and implementing these practices can be a pain but its a choice. You either suffer the pain of the paperwork or the pain of the outage.
At least if potential problems are known about they can be more easily dealt with when they appear and bite you and they will appear.
The mobile industry is very much a cut throat industry and this dual outage with Blackberry will do them no good at all because others will seize upon it as a sign of Blackberries weak infrastructure and they will be right.
To recover from this Blackberry need to do a through review of their systems and DR processes and ensure that if this happens again they have the ability to recover from it very rapidly. They are, after all, reliant on their userbase for their income and they have failed a major test.
2 comments:
RIM's major selling point for the mass BlackBerry market is the BIS connection and the BBM service (This is what's making BlackBerries very popular with teenagers lately - text messaging on steriods). These outages don't really help them shift new £500 handsets to customers on the premise of a wonky system.
I'm surprised that a production environment system wouldn't have a manual "OMG IT'S ALL GONE TITS UP" switch to flick the service into a failsafe, less-features-but-messages-still-get-there mode. People wouldn't really mind so much not being able to transmit files over BBM for a day or two, for instance, but the entire messaging capacity failing for the same time causes serious problems for those who rely on it as a method of communication to not just friends, but business contacts. This lack of failure tolerance has the real capacity to cost a lot of people a fair bit of money; money which should rightfully come out of RIM's pockets since their "Robust, reliable" BIS services is the one that's let them down.
Manual failover isn't ideal, but it's better than no failover at all.
This reminds me of a major outage by our second biggest Telco.
The whole state of Queensland was offline for a day. This include phones, mobiles and internet. They didn't specifically have a single point of failure.
The main fibre across the boarder was cut by road works. All traffic was routed to the secondary link. Unfortunately this went down to a maintenance fault which bought the whole network down.
It took them 8 hours to fix because both incidents happened quite a distance from the technicians.
My father is basically the technical support person in our biggest telco's DR. Although most of the time he is only dealing with 'small' problems he has had to deal with the flooding at the start of this year which was a big effort from all.
Post a Comment