Reviewing the British Airways IT failure
It's probably not escaped anyones attention that British Airways have had a major IT failure today, the big question around this is why the failure is not at 10 hours in and still counting.
British Airways have been tight lipped over the whole thing saying nothing more than they are experiencing a "major systems failure".
Normally, a failure such as this would be declared a major incident around an hour or two in and a manual failure over to a second DC would be initiated. The fact that it's now at the 10 hour mark and still counting means that either the automated and manual failovers didn't work or that to save money, BA have either reduced testing in the failover DC or removed it all together. Whatever the reason is, it doesn't look good on them.
The BBC have an article up discussing this and I'd like to go through the points and add my own take:
So, 10 hours in and they still only "believe" it's a power supply issue? This is a very weird thing to say as a power supply issue would be pretty obvious, it's also not something that should cause such a major outage. Because of the way three phase power works, this should affect, at most, a third of the racks. Even if the power outage takes down a major piece of equipment such as a central SAN, switch or firewall, there should be redundant systems in other locations in the same DC and in other DC's.
In fact, BA should have the ability to fail over to a remote data center. The fact that they haven't raises it's own set of questions.
Outsourcing isn't always a bad thing, it can have a lot of positives and there is no guarantee that having not outsourced that this issue couldn't have happened. If BA are tightening the purse strings then it's quite possible that this issue could have still occurred, it all depends on the ability to fail over and, on how often that fail over has been tested.
BA saying that they'd never compromise the integrity of security of their systems is an untruth though, they've already done that and a 10 hour plus outage proves that, at some level, there has been a compromise on the integrity of the systems otherwise this problem should have been a minor footnote and not a front page story.
Another thing to note is that a failure like this has wider impacts beyond just the failure itself and the cost of the downtime. There is a cost of a loss of faith in the brand and I have to wonder just what the cost to BA is there? Too many companies compromise on failover and recovery capabilities because of cost only for it to bite them hard when they cannot recover easily from an issue such as this.
At this point, I have to call bullshit on this being power failure. A properly powered up DC just doesn't have those points of vulnerability and, even if it is, what has happened to the UPS and generator systems that all decent DC's host?
To me, this feels more like another NHS ransomware incident. Hopefully more information will come to light over the following days.
Subscribe to Ramblings of a Sysadmin
Get the latest posts delivered right to your inbox