Reviewing Vodafones IT failure
A few days ago,
The matters under investigation were a consequence of errors during a complex IT migration which involved moving more than 28.5 million customer accounts and almost one billion individual customer data fields from seven legacy billing and services platforms to one, state-of-the-art system.
This paragraph is probably one of the most important in the whole document, it explains what they were attempting to do. However, it does include one key phrase that always indicates that a project is doomed to failure. It states that the project is a complex one.
No IT project should be complex. If it is, then you've not broken the work into small enough chunks.
Clearly, this project had a challenge but moving data from one system to another is NOT complex. The source data absolutely needs to be understood and the fact that there are seven source systems clearly means that there will be challenges around merging this data into a central system, no doubt, each of those seven systems will have different database schemas and there could be the same customer records held in each system but potentially with slight variations in how that data is stored.
All these issues add challenges but none of it is particularly complex. Clearly, DBA's who understand each schema need to be involved and they need to be able to call a halt if they feel that something isn't right.
Despite multiple controls in place to reduce the risk of errors, at various points a small proportion of individual customer accounts were incorrectly migrated, leading to mistakes in the customer billing data and price plan records stored on the new system. Those errors led to a range of different problems for the customers affected which – in turn – led to a sharp increase in the volume of customer complaints.
Controls aren't going to help. This is the sort of project that needed to be done two or three times. By that, I mean that you do the data migration once and write scripts to pinpoint specific errors. These would be errors that the DBA's and front end staff have suggested are the ones that they expect to see based on their experience of customer issues and experience of the database structure.
Once done and once you've confirmed that these tools work and find problems you then migrate your own staff and a select number of customers who have been previously contacted and offered some level of compensation to act as guinea pigs for this work. Say, 5,000 accounts.
Each one of those accounts would have a flag so that if and when they call in with problems they get put through to someone who is aware that they've been migrated.
You are NOT going to totally eliminate problems, there will always be the weird bug that no one foresaw just because a user took out a contract on Friday 13th at 6pm and the backspace key was hit three times or something equally bizarre but, by having staff and having volunteer customers serve as guinea pigs and by doing the migration the first time you should be able to eliminate 95% of errors and, you can show the regulatory authorities that you've done every single thing that you can to both allow for the project to be called to a halt should someone feel that's what is required and to catch errors. These two safeguards are vital otherwise a project will just continue to lurch from hidden disaster to hidden disaster whilst the execs sit fat dumb and happy because the status reports all say "All green!".
Once the issue was finally escalated to senior management there was a prompt, full and thorough investigation and every effort was made to fix the underlying failure and to refund in full all affected customers as quickly as possible.
"Finally escalated to senior management". To me, this just reinforces the vision that senior management either weren't interested or were putting pressure onto the project to "just get it done". Often, this pressure isn't deliberate but subtle. Comments like "Every day the project is delayed costs us!" are used to all to often exert influence and in turn, management don't get a truthful picture of what is going on.
The IT failure involved was resolved by April 2015 – approximately 11 weeks after senior managers were finally alerted to it – with a system-wide change implemented in October 2015 that – as Ofcom acknowledges – means this error cannot be repeated in future.
More broadly, we have conducted a full internal review of this failure and, as a result, have overhauled our management control and escalation procedures. A failure of this kind, while rare, should have been identified and flagged to senior management for urgent resolution much earlier.
I would love to see what this internal review uncovered. I did ask vodafone about it but they refuse to share it as it's an internal review. I can have a guess though and I suspect that it's a lot of what I've already mentioned here. I also suspect that most of the issues fall into the same project sins list that Tony Collins put together in his book "". Even thought this book was first published in 1997, 19 years on, the same mistakes get made time after time, after time. It seems those that ignore history are truly condemned to repeat it.
Crash: 10 easy ways to avoid a computer disaster
Subscribe to Ramblings of a Sysadmin
Get the latest posts delivered right to your inbox