Tuesday, March 21, 2017

Avoiding a gitlab style outage with Blue/Green deployments

As most of you are probably aware, Gitlab is, in part, a source code hosting repository which suffered something of a major outage just a few weeks ago. Unusually, they posted a very full and frank report on what actually happened. 

You can read more of this here.

Thursday, February 09, 2017

Fighting Azure AD Connects custom installer

I’ve recently been spending more and more time looking into various cloud technologies such as AWS and Azure. One of the projects I’ve been working on required the on-premises active directory to be extended to Azure to allow for a future introduction of various Office365 elements.
The process for doing this is fairly easy as it’s just a matter of installing the Azure Active Directory Connect tool onto a server, creating the domain in the Azure portal and then waiting for Azure AD connect to Sync.

For most installs, the bundled SQL Express will do the job, for others you'll want to use an existing SQL server, however, this is not as easy as it should be.....

Read the full article on Starwind's blog page.

Tuesday, December 13, 2016

Do you really need a Hyper-V cluster?

Something I've come across a few times over the last couple of weeks is people asking for a Hyper-V cluster because management are demanding HA and 100% uptime.

The problem is that a Hyper-V cluster will only provide HA for the physical hosts that make up the cluster and not for the VM's that are running on the nodes in the cluster. If a VM crashes, gets corrupt or has some other fault then it's down. All the host level clustering in the world won't help that.

I'd even go further and say that a Hyper-V cluster harms the requirement to have HA as a cluster needs to have the VM's hosted on shared storage. This then put's the company in the strange position of having HA on the hosts but not the on the storage. When I've raised this, I've been told "The SAN won't break, it's a SAN".

Sorry, SAN's can and will break. They are just bits of hardware that run bits of software. Normally they re very reliable bits or hardware with very reliable bits of software but still software with bugs in. I've even had the situation where management are too scared to upgrade the SAN because it contains data that is just too important to lose. By choosing this route, you really are choosing to put all your eggs in one basket.

Yes, you've got HA, at the hypervisor level but you have nothing for the VM's and nothing for the storage. You're still down if the SAN has a bad day or if the VM crashes. You're really not in the place you want to be in as you don't have HA at the guest level. So, how do we have HA for the guests?

Like anything, there are costs involved and the less downtime you want then the more you have to spend out. Today, a lot of the applications have HA in some form available inside the software itself. For example, SQL has always on availability groups, Exchange has Database Availability groups, AD is built to be fault tolerant anyway and for file storage you've got DFS. All you need is to split these workloads between two hypervisors running on local storage and you have guest level HA. Take a host out and you're still working.

What about other workloads? In this case there is nothing wrong with a cluster. Build a two node cluster in a VM and use a witness share on a NAS somewhere so that you've got quorum and you've then got HA for those applications as well. Again, as long as the workload is split across two or more physical hosts you're good to go. You've also got Hyper-V's ability to replicate a VM from one host to another which, as long as the bandwidth is there, gives you a nice DR option.

Hyper-V also supports shared nothing live migration. You don't even need a SAN to migrate a VM from one host to another. Add in Veeam for backups (following the 3-2-1 rule of of course) and you've got yourself a fairly decent setup.

In summary, as more applications have some level of application aware HA and as the list of features in Hyper-V and tools like veeam, grow, I don't see a need to ever cluster Hyper-V servers as it's not like clustering them adds in features that we see with a VMWare cluster. Maybe one day they will but I cannot see it.

Monday, November 07, 2016

Reviewing Vodafones IT failure

A few days ago, Vodafone responded to Ofcoms fine and attempted to explain why they failed

The matters under investigation were a consequence of errors during a complex IT migration which involved moving more than 28.5 million customer accounts and almost one billion individual customer data fields from seven legacy billing and services platforms to one, state-of-the-art system. 

This paragraph is probably one of the most important in the whole document, it explains what they were attempting to do. However, it does include one key phrase that always indicates that a project is doomed to failure. It states that the project is a complex one.

No IT project should be complex. If it is, then you've not broken the work into small enough chunks.

Clearly, this project had a challenge but moving data from one system to another is NOT complex. The source data absolutely needs to be understood and the fact that there are seven source systems clearly means that there will be challenges around merging this data into a central system, no doubt, each of those seven systems will have different database schemas and there could be the same customer records held in each system but potentially with slight variations in how that data is stored.
All these issues add challenges but none of it is particularly complex. Clearly, DBA's who understand each schema need to be involved and they need to be able to call a halt if they feel that something isn't right.

Despite multiple controls in place to reduce the risk of errors, at various points a small proportion of individual customer accounts were incorrectly migrated, leading to mistakes in the customer billing data and price plan records stored on the new system. Those errors led to a range of different problems for the customers affected which – in turn – led to a sharp increase in the volume of customer complaints. 

Controls aren't going to help. This is the sort of project that needed to be done two or three times. By that, I mean that you do the data migration once and write scripts to pinpoint specific errors. These would be errors that the DBA's and front end staff have suggested are the ones that they expect to see based on their experience of customer issues and experience of the database structure.
Once done and once you've confirmed that these tools work and find problems you then migrate your own staff and a select number of customers who have been previously contacted and offered some level of compensation to act as guinea pigs for this work. Say, 5,000 accounts.
Each one of those accounts would have a flag so that if and when they call in with problems they get put through to someone who is aware that they've been migrated.

You are NOT going to totally eliminate problems, there will always be the weird bug that no one foresaw just because a user took out a contract on Friday 13th at 6pm and the backspace key was hit three times or something equally bizarre but, by having staff dogfood and having volunteer customers serve as guinea pigs and by doing the migration the first time you should be able to eliminate 95% of errors and, you can show the regulatory authorities that you've done every single thing that you can to both allow for the project to be called to a halt should someone feel that's what is required and to catch errors. These two safeguards are vital otherwise a project will just continue to lurch from hidden disaster to hidden disaster whilst the execs sit fat dumb and happy because the status reports all say "All green!".
 Once the issue was finally escalated to senior management there was a prompt, full and thorough investigation and every effort was made to fix the underlying failure and to refund in full all affected customers as quickly as possible.
"Finally escalated to senior management". To me, this just reinforces the vision that senior management either weren't interested or were putting pressure onto the project to "just get it done". Often, this pressure isn't deliberate but subtle. Comments like "Every day the project is delayed costs us!" are used to all to often exert influence and in turn, management don't get a truthful picture of what is going on.

 The IT failure involved was resolved by April 2015 – approximately 11 weeks after senior managers were finally alerted to it – with a system-wide change implemented in October 2015 that – as Ofcom acknowledges – means this error cannot be repeated in future. 
More broadly, we have conducted a full internal review of this failure and, as a result, have overhauled our management control and escalation procedures. A failure of this kind, while rare, should have been identified and flagged to senior management for urgent resolution much earlier.
I would love to see what this internal review uncovered. I did ask vodafone about it but they refuse to share it as it's an internal review. I can have a guess though and I suspect that it's a lot of what I've already mentioned here. I also suspect that most of the issues fall into the same project sins list that Tony Collins put together in his book "Crash: 10 easy ways to avoid a computer disaster".  Even thought this book was first published in 1997, 19 years on, the same mistakes get made time after time, after time. It seems those that ignore history are truly condemned to repeat it.

Tuesday, July 12, 2016

The Phoenix Project - a book review - of sorts!

There is a book doing the rounds at the moment called "The Phoenix Project" which talks a lot about new ideas coming into the IT world around Kanban and devops.

I've recently read it and I'll admit that it's got some interesting ideas but like any work delivery method these are ideas that must be embraced at a higher level than just in IT and the book does comment on this as well, unfortunately, the book is a bit of an ideal vision of how everyone gets on board with the kanban method of working and how it changes things for the better.

I have no disagreements with Kanban, it's a good method to keep on top of things and I'm trying to make myself more disciplined in it's usage by having my workload in Trello and using that to prioritise.

For those of you who don't know what Kanban is, the best way I can describe it is to think of a collection of five paper trays which are labelled In, Action, Filing, In progress and pending.

In - This is new work that has arrived and needs to be prioritised
Action - This is Work in Progress, it needs to be done but is in the queue - whilst in the queue it's not generating any value.
Filing - This is work that is completed but it just needs to be signed off/filed away
In progress - This is work that is being done.
Pending - This is work that is waiting on someone else, much like action, it's in a queue but in this case waiting for a third party.

In Trello, I have columns (the kanban)  that follow a similar pattern. This way, I know what I'm waiting on, what needs to be done and what is coming in. It's just my way of keeping myself organised.

The phoenix project does a nice job of introducing Kanban and the whole work in progress flow although one of the characters is rather annoying as a deus ex machina who is there to kick the protagonist in the head and lead him down the right route.

And that's the big problem with this book. It does a great job of introducing kanban and devops to the world of it, it cannot be faulted for that but it shows everyone happily pitching in and helping out when they see it work and there is the problem, in the real world it doesn't take much for people to slip back into old habits and screw themselves over.

I've worked for quite a few companies in my years and I've done CMM, ITIL, and now it's all about Kanban and Devops.

Now, don't get me wrong, I actually highly approve of ITIL, Kanban and devops but once again, it's needs to be pushed from the top with EVERYONE getting involved to make kanban work. If you only have one or two people doing it then it's going to fall apart very quickly for the department. However, this is not a reason to not do this sort of thing on some sort of personal level, just to keep yourself organised.

One of the most important lists I have in trello is called "known issues" where I make a note of an issue that I've spotted. This only needs to be a few words or a picture of the issue but it's come in handy a few times in there past where something has needed to be done and existing issues have had to be dealt with first but then I'm one of these annoying people who believes in taking notes so I think I'm more predisposed to working in a type of Kanban style anyway.

Either way, I do recommend giving the book a passing glance. It's the first time I've seen a workflow method written down as a story so it's worth a look just for that and you never know, you might pick up a few tips.

Tuesday, July 05, 2016

NTP in a virtualised world

Let me start this off by saying that I love NTP. The whole way the protocol has been designed is truly elegant and it is such an important protocol that is often neglected that I thought I'd put together a blog article on how I configure NTP and why.

Before that, it's important to go through a few things about how NTP works, if you're familiar with NTP, feel free to skip to the next bit.

Basics of NTP

The first thing to note is that NTP relays time in UTC format

If you think about it, NTP has to be ignorant of timezones. It's whole job is to keep accurate time and timezones will just upset that as there are so many of them. Better to just keep to something like UTC internally and have the OS deal with the timezone.

One question this always generates is "What happens if I point my UK server at a US time zone source?" - Because NTP doesn't care about timezones, those NTP servers in the US will have the exact same time as those based in the UK and across the rest of the world. It's up to the operating system to sort out the time zone so yes, pointing NTP at servers in another country is fine and it's not going to force all your machines into the time zone of the country where the NTP server is! 

Another thing to realise is that NTP is hierarchical. Each time server in the chain is said to be at a particular stratum level. Stratum 0 is an atomic clock. Stratum 1 and 2 would be NTP servers around the world that you can connect to. Your internal time source (if you use one) would be stratum 3 or more likely, stratum 4. There is a very good explanation of it all here https://ntpserver.wordpress.com/2008/09/10/ntp-server-stratum-levels-explained/

It's also key to note that NTP takes into account the latency involved in contacting an NTP server. This means that even connecting to NTP servers around the world you should find that your time is still within 500ms of reference time and even accurate down to less than 100ms.

It's also always a good idea to provide multiple external NTP servers, in testing, I've found that three are optimum as three allows for one to ruled invalid by NTP cross checking the three servers and it allows for NTP to use some clever math to offset both the latency of all three and to average out the time received from all three to ensure that the time you're getting is as close to reference time as it can be. (see, I said that NTP was elegant!). In testing, it was often possible to get time on the server to within 10ms around 90% of the time and within 100ms 100% of the time.

Now that I've gone through some of the aspects of how NTP works, another key question is "Does this apply in a virtualised world"? The answer is Yes, but with a few caveats to watch out for.

NTP issues to watch out for:

1. Circular time referencing.

The first issue is to watch for is around circular time. This occurs when a source of time is set up as a virtual machine and is pulling it's time from the host which in turn pulls it's time from the VM. At this point the whole hierarchy breaks time as the hypervisor is both a recipient and a giver of time. This is something that needs to be avoided as it'll cause no end of issues as there is no way to correct for clock drift as neither server is authoritative in terms of the NTP hierarchy.

This is why it is vitally important to turn OFF the ability to pull time from the host on ANY server that participates in any sort of NTP hierarchy. This sort of issue is most often seen when the VM is running as the active directory PDC emulator. In that case it's always best that BOTH the hypervisor and the DC's pull time from an external source such as the pool NTP servers.

2. Invalid time in ESXi

The screenshot above is a very important one, it shows that NTP is configured and running but that for some reason the time on the host is WRONG which is why it's highlighted in red. This can often be because it's not possible for the host to connect to the NTP server, this is commonly see when external NTP is used and it's blocked on the firewall.

Why does accurate time matter on the host if VMWare tools/integration tools are turned off?

Even if such tools turned off, the VM has to ask the hypervisor for time under two special conditions, the first is when the VM is powered on. Because a VM doesn't have a CMOS, it has absolutely no idea what the time is when it's first powered on. The only place it can get it from is the host. If the time on the host is wrong then you've already got a problem and the VM isn't even started yet! This is key when you're dealing with Active Directory which, by default, needs time to be within 5 minutes of UTC across the domain (AD also ignores timezones internally), if isn't then you're going to have authentication issues.

The second case is during VMotion/Live migration tasks. For a fraction of a second, that VM does not exist on the old host or the new host. When the host your transferring it to takes on responsibility for the VM, the VM has to ask the host for the time. Again, if it's wrong then you have issues.

Those are the two major misconfigurations/issues I've seen the most in virtualised environments with the circular time setup being the most common. Getting NTP right even in a small network is key to avoiding strange authentication issues and other problems.

Monday, June 13, 2016

Have you used Chocolatey?

If you've not come across Chocolatey then it's certainly worth a look. Those of you who have used Linux will be familiar with yum and/or apt-get, well, this is the Windows version of that software.

As a Windows admin for some years, I've used tools such as Nlite and Ninte to create custom builds and automated installs. I've also used Windows GPO's to install software or to make software available in the Add/Remove programs list but nothing quite compares to the ease of which chocolatey allows software to be installed.

The way it all works will be very familiar to Linux admins. Chocolatey uses a repository where all the install files live and then a very simple command will allow for the necessary software to be installed as long as you have an internet connection to the repository.

It's also possible to setup an internal repository so that you can install both your own software and third party software from a trusted internal resource as there is always a risk that someone has uploaded a malicious installer to Chocolatey.

Details about the process for using Chocolatey are here and I do encourage you to give it a go. If you use automated/unattended installations then using Chocolatey to install applications not only makes sure that you've got the latest versions but that you also have a relatively simple upgrade method.

Some have asked why I'm so interested in this sort of technology and it's simply because I've had a bit of a revelation of late. That revelation is around automation and devops.
I suspect that most IT folk have unattended scripts for installing Windows and I also suspect that many have a few scripts floating around that they reinvent when the can't find the original.
Devops is changing all of that, there are hundreds of tools that are there to simplify all of this and it's my firm believe that tools like Chocolatey are part of a huge cultural change coming to IT. Check it out, it's going to be the future.