Tuesday, April 25, 2017

Atlassian accidentally DDOS'ed their own password change service.

As a company, if you're facing a suspected security breach then forcing your customers and clients to change passwords is not a bad thing to do. However, when you do this don't end up DDOS'ing your own system as Atlassian have apparently done.

If you don't know, Atlassian make the JIRA project management and tracking tool as well as the popular hipchat chat room software and they've recently acquired Trello.

The start of the sorry tale kicks off on Sunday (Click on the images for a clearer view):

The blog article is open, well written and talks about the password storage mechanism which is bcrypt with hashed and salted passwords. All good there then.

The last line is an interesting one though "If you are a user of HipChat.com and have not received an email from our Security Team with these instructions, we have found no evidence that you are affected by this incident." I'll come back to this.

The blog post does admit that customers may have lost data. Something to be aware of with cloud based chat services like hipchat.

Fair enough. I'm a hipchat and JIRA user and I got an email late on Sunday night which I ignored. This morning around 6:30am I logged in to JIRA and was forced to change password. Again, this is fine, it's a decent response to a security issue but when attempting to change the password the ID server (id.atlassian.com) falls over. Thinking that this is going to be a minor issue I waited an hour, same problem. No updates from Atlassian.

A few hours later Atlassian admitted that yes, there is a problem with the password changing process:

Some six hours later, Atlassian admitted that yes, they've DDOS'ed themselves:

This brings me back to the comment I mentioned earlier "If you are a user of HipChat.com and have not received an email from our Security Team with these instructions, we have found no evidence that you are affected by this incident.".

Are Atlassian seriously saying that a subset of users being forced to change passwords is enough to take down the system that changes passwords? It seems worrying to me if that's true. It's also worrying that they took 6 hours to spot that it's a capacity issue and it's worrying to me that they can't rapidly add capacity to a system and this leads to an important question - is the password change server THAT much of a custom build?

I do hope that they do a full root cause analysis of both the security issue, how the data was extracted and the subsequent issues with the password change server.

Friday, April 14, 2017

VMware’s Photon and containers in VMware

I've been exploring VMWare's photon and containers in VMware via Photon. From my experiences, Photon works incredibly well and is faster than other distros in VMware. I've written more about it here.

Tuesday, March 21, 2017

Avoiding a gitlab style outage with Blue/Green deployments

As most of you are probably aware, Gitlab is, in part, a source code hosting repository which suffered something of a major outage just a few weeks ago. Unusually, they posted a very full and frank report on what actually happened. 

You can read more of this here.

Thursday, February 09, 2017

Fighting Azure AD Connects custom installer

I’ve recently been spending more and more time looking into various cloud technologies such as AWS and Azure. One of the projects I’ve been working on required the on-premises active directory to be extended to Azure to allow for a future introduction of various Office365 elements.
The process for doing this is fairly easy as it’s just a matter of installing the Azure Active Directory Connect tool onto a server, creating the domain in the Azure portal and then waiting for Azure AD connect to Sync.

For most installs, the bundled SQL Express will do the job, for others you'll want to use an existing SQL server, however, this is not as easy as it should be.....

Read the full article on Starwind's blog page.

Tuesday, December 13, 2016

Do you really need a Hyper-V cluster?

Something I've come across a few times over the last couple of weeks is people asking for a Hyper-V cluster because management are demanding HA and 100% uptime.

The problem is that a Hyper-V cluster will only provide HA for the physical hosts that make up the cluster and not for the VM's that are running on the nodes in the cluster. If a VM crashes, gets corrupt or has some other fault then it's down. All the host level clustering in the world won't help that.

I'd even go further and say that a Hyper-V cluster harms the requirement to have HA as a cluster needs to have the VM's hosted on shared storage. This then put's the company in the strange position of having HA on the hosts but not the on the storage. When I've raised this, I've been told "The SAN won't break, it's a SAN".

Sorry, SAN's can and will break. They are just bits of hardware that run bits of software. Normally they re very reliable bits or hardware with very reliable bits of software but still software with bugs in. I've even had the situation where management are too scared to upgrade the SAN because it contains data that is just too important to lose. By choosing this route, you really are choosing to put all your eggs in one basket.

Yes, you've got HA, at the hypervisor level but you have nothing for the VM's and nothing for the storage. You're still down if the SAN has a bad day or if the VM crashes. You're really not in the place you want to be in as you don't have HA at the guest level. So, how do we have HA for the guests?

Like anything, there are costs involved and the less downtime you want then the more you have to spend out. Today, a lot of the applications have HA in some form available inside the software itself. For example, SQL has always on availability groups, Exchange has Database Availability groups, AD is built to be fault tolerant anyway and for file storage you've got DFS. All you need is to split these workloads between two hypervisors running on local storage and you have guest level HA. Take a host out and you're still working.

What about other workloads? In this case there is nothing wrong with a cluster. Build a two node cluster in a VM and use a witness share on a NAS somewhere so that you've got quorum and you've then got HA for those applications as well. Again, as long as the workload is split across two or more physical hosts you're good to go. You've also got Hyper-V's ability to replicate a VM from one host to another which, as long as the bandwidth is there, gives you a nice DR option.

Hyper-V also supports shared nothing live migration. You don't even need a SAN to migrate a VM from one host to another. Add in Veeam for backups (following the 3-2-1 rule of of course) and you've got yourself a fairly decent setup.

In summary, as more applications have some level of application aware HA and as the list of features in Hyper-V and tools like veeam, grow, I don't see a need to ever cluster Hyper-V servers as it's not like clustering them adds in features that we see with a VMWare cluster. Maybe one day they will but I cannot see it.

Monday, November 07, 2016

Reviewing Vodafones IT failure

A few days ago, Vodafone responded to Ofcoms fine and attempted to explain why they failed

The matters under investigation were a consequence of errors during a complex IT migration which involved moving more than 28.5 million customer accounts and almost one billion individual customer data fields from seven legacy billing and services platforms to one, state-of-the-art system. 

This paragraph is probably one of the most important in the whole document, it explains what they were attempting to do. However, it does include one key phrase that always indicates that a project is doomed to failure. It states that the project is a complex one.

No IT project should be complex. If it is, then you've not broken the work into small enough chunks.

Clearly, this project had a challenge but moving data from one system to another is NOT complex. The source data absolutely needs to be understood and the fact that there are seven source systems clearly means that there will be challenges around merging this data into a central system, no doubt, each of those seven systems will have different database schemas and there could be the same customer records held in each system but potentially with slight variations in how that data is stored.
All these issues add challenges but none of it is particularly complex. Clearly, DBA's who understand each schema need to be involved and they need to be able to call a halt if they feel that something isn't right.

Despite multiple controls in place to reduce the risk of errors, at various points a small proportion of individual customer accounts were incorrectly migrated, leading to mistakes in the customer billing data and price plan records stored on the new system. Those errors led to a range of different problems for the customers affected which – in turn – led to a sharp increase in the volume of customer complaints. 

Controls aren't going to help. This is the sort of project that needed to be done two or three times. By that, I mean that you do the data migration once and write scripts to pinpoint specific errors. These would be errors that the DBA's and front end staff have suggested are the ones that they expect to see based on their experience of customer issues and experience of the database structure.
Once done and once you've confirmed that these tools work and find problems you then migrate your own staff and a select number of customers who have been previously contacted and offered some level of compensation to act as guinea pigs for this work. Say, 5,000 accounts.
Each one of those accounts would have a flag so that if and when they call in with problems they get put through to someone who is aware that they've been migrated.

You are NOT going to totally eliminate problems, there will always be the weird bug that no one foresaw just because a user took out a contract on Friday 13th at 6pm and the backspace key was hit three times or something equally bizarre but, by having staff dogfood and having volunteer customers serve as guinea pigs and by doing the migration the first time you should be able to eliminate 95% of errors and, you can show the regulatory authorities that you've done every single thing that you can to both allow for the project to be called to a halt should someone feel that's what is required and to catch errors. These two safeguards are vital otherwise a project will just continue to lurch from hidden disaster to hidden disaster whilst the execs sit fat dumb and happy because the status reports all say "All green!".
 Once the issue was finally escalated to senior management there was a prompt, full and thorough investigation and every effort was made to fix the underlying failure and to refund in full all affected customers as quickly as possible.
"Finally escalated to senior management". To me, this just reinforces the vision that senior management either weren't interested or were putting pressure onto the project to "just get it done". Often, this pressure isn't deliberate but subtle. Comments like "Every day the project is delayed costs us!" are used to all to often exert influence and in turn, management don't get a truthful picture of what is going on.

 The IT failure involved was resolved by April 2015 – approximately 11 weeks after senior managers were finally alerted to it – with a system-wide change implemented in October 2015 that – as Ofcom acknowledges – means this error cannot be repeated in future. 
More broadly, we have conducted a full internal review of this failure and, as a result, have overhauled our management control and escalation procedures. A failure of this kind, while rare, should have been identified and flagged to senior management for urgent resolution much earlier.
I would love to see what this internal review uncovered. I did ask vodafone about it but they refuse to share it as it's an internal review. I can have a guess though and I suspect that it's a lot of what I've already mentioned here. I also suspect that most of the issues fall into the same project sins list that Tony Collins put together in his book "Crash: 10 easy ways to avoid a computer disaster".  Even thought this book was first published in 1997, 19 years on, the same mistakes get made time after time, after time. It seems those that ignore history are truly condemned to repeat it.

Tuesday, July 12, 2016

The Phoenix Project - a book review - of sorts!

There is a book doing the rounds at the moment called "The Phoenix Project" which talks a lot about new ideas coming into the IT world around Kanban and devops.

I've recently read it and I'll admit that it's got some interesting ideas but like any work delivery method these are ideas that must be embraced at a higher level than just in IT and the book does comment on this as well, unfortunately, the book is a bit of an ideal vision of how everyone gets on board with the kanban method of working and how it changes things for the better.

I have no disagreements with Kanban, it's a good method to keep on top of things and I'm trying to make myself more disciplined in it's usage by having my workload in Trello and using that to prioritise.

For those of you who don't know what Kanban is, the best way I can describe it is to think of a collection of five paper trays which are labelled In, Action, Filing, In progress and pending.

In - This is new work that has arrived and needs to be prioritised
Action - This is Work in Progress, it needs to be done but is in the queue - whilst in the queue it's not generating any value.
Filing - This is work that is completed but it just needs to be signed off/filed away
In progress - This is work that is being done.
Pending - This is work that is waiting on someone else, much like action, it's in a queue but in this case waiting for a third party.

In Trello, I have columns (the kanban)  that follow a similar pattern. This way, I know what I'm waiting on, what needs to be done and what is coming in. It's just my way of keeping myself organised.

The phoenix project does a nice job of introducing Kanban and the whole work in progress flow although one of the characters is rather annoying as a deus ex machina who is there to kick the protagonist in the head and lead him down the right route.

And that's the big problem with this book. It does a great job of introducing kanban and devops to the world of it, it cannot be faulted for that but it shows everyone happily pitching in and helping out when they see it work and there is the problem, in the real world it doesn't take much for people to slip back into old habits and screw themselves over.

I've worked for quite a few companies in my years and I've done CMM, ITIL, and now it's all about Kanban and Devops.

Now, don't get me wrong, I actually highly approve of ITIL, Kanban and devops but once again, it's needs to be pushed from the top with EVERYONE getting involved to make kanban work. If you only have one or two people doing it then it's going to fall apart very quickly for the department. However, this is not a reason to not do this sort of thing on some sort of personal level, just to keep yourself organised.

One of the most important lists I have in trello is called "known issues" where I make a note of an issue that I've spotted. This only needs to be a few words or a picture of the issue but it's come in handy a few times in there past where something has needed to be done and existing issues have had to be dealt with first but then I'm one of these annoying people who believes in taking notes so I think I'm more predisposed to working in a type of Kanban style anyway.

Either way, I do recommend giving the book a passing glance. It's the first time I've seen a workflow method written down as a story so it's worth a look just for that and you never know, you might pick up a few tips.