Friday, July 03, 2015

How to do DNS correctly

Time and again people seem to be doing DNS outside of best practice rules so I thought it might be a good time to go through how DNS works, what DNS best practice is (with regards to a windows environment) and why it's like that.

In a nutshell, the most common mistake I see with DNS configuration is this:



This configuration is put in place for one of two reasons:

1. It is there to resolve external addresses should the internal fail.
2. It's there to provide internal access should the internal fail.

Point 1 is the most common that I come across and it's very wrong because that is not how DNS works.
When a name query is run, DNS will ask the first name server to handle it, if that name server replies and says "I don't know what that name is that you've sent me", that's it. DNS will not ask the second DNS server because it has a valid reply. Yes, negative replies are valid replies. They are even cached locally for a period of time. All of this is covered in RFC 2308.

Point 2 appears to make some sense, if the internal DNS server dies then queries to it fail but, hey, at least people can still get on the internet - right?
Well yes but.... Every now and then that first DNS server is going to be too busy to reply so the client will ask the second DNS server. If the query is for an internal resource then the second DNS server won't know about it and suddenly you've got this weird condition where a client appears to be refusing to ask the internal DNS server and nothing internally is being resolved, again this is due to the cached negative responses covered in point 1.

Best practice is always to have your clients use internal DNS servers and it's always best practice to have two internal DNS servers.

The second big configuration error that I see is people using internal name servers in the forwarders. This is utterly pointless as the forwarder is there to handle queries that your internal DNS servers cannot. So, internally if you ping www.google.com your internal DNS servers won't know what that is so will pass it on to the forwarder.
If your forwarders are just internal servers then the query will either take a long time to complete (i.e. until it gets out of the network) or it'll just fail.

In summary, Internal DNS server IP addresses for clients, forwarders on the DNS servers for everything else. Stick to that and DNS shouldn't ever be a problem.

Tuesday, March 31, 2015

My thoughts on handling a system outage

If its going to happen then it'll happen at the worst possible time. It'll happen that friday evening just before going home/beer o'clock and it'll be a long weekend plus whatever system dies a death will be the very system that you have logged on to exactly once, several months ago and that was in error.
In short, its going to be the one system that you know nothing about and the one that normally just works.

And then the phones will ring with people demanding action because it just so happens that the boss type wants something from that system before he leaves for the day and no it cannot wait because it has to be right now.

 So what do you do in these situations?

 Believe it or not, the answer is "nothing" - at least at first.

 No matter the issue, no matter how many people are telling you to get it fixed now the very worst thing you can do is try things out 'to see if it works'.
 You might get lucky but you probably won't and by trying things out at random you will turn what is probably a simple thing into an epic hunt to track down what it was you changed just to get the system back to how it was before you 'just tried something'.

 Any system that breaks needs to be treated like a crime scene, there is evidence there of what caused it to break. This evidence needs to be collected, something as simple as a reboot may well fix the problem but it may not and in rebooting you could lose that evidence and it may well be the very clue that is needed to stop the problem from happening again, possibly on multiple systems. So what to do?

At this point its very easy to bow to pressure and try something, anything to get it working and get out of there for the weekend but this potentially puts you into the above category where you'll be fighting to get it back to a known, broken state!

Firstly, preserve the evidence. If the box has blue screened, take a screenshot via drac or ilo or on your phone and only then reboot it.

Once it has rebooted grab a copy of the dump file, there are some excellent online tools that will analyse the dump files for you.

 If the box hasn't blue-screened then try and grab a copy of the state of the machine - what services are running?
What applications are running?
What is its ip configuration?
How busy is it?

Secondly, preserve the logs.
Take copies of the system and application event logs.
If the application has its own logs then copy those.
Ideally, all logs should already be sent to a syslog server, of course this is fine for linux but what about windows? Again, there are agents for windows that will perform this task admirably.

So, now you've got some basic evidence, what next?
This all depends on the system but how people access it, for example, is it ok from inside the company but broken from outside? If so, the server is fine but you may have a connectivity, firewall or load balancer issue.
If its broken from inside and out then its probably the server.
No matter the issue, basic connectivity tests are a good place to start.
Can the server contact its default gateway?
Can the server contact a server in another vlan?
Can the server contact the internet?
Googles dns servers at 8.8.8.8 and 8.8.4.4 are wonderful for connectivity tests!
Can the server resolve names? Nslookup is the best tool here.

I'll expand more on this in a later article but suffice to say, the 7 layer osi model can be a handy reference for troubleshooting. Working 'down' the model from application to physical is a good, methodical way to troubleshoot.

 In summary, both logs and the state of the machine represent the digital fingerprint of the issue. Its important to preserve them. It shouldn't take more than a few minutes to gather it, you need to make sure that you keep it together.

One other important thing to note is that once you've found the problem and if it is a bluescreen or crash due to a bug, driver, patch,etc its important to check out other systems that could be vulnerable to the same issue as this potentially will save you or a colleague from another nightmare friday night troubleshooting scenario.

Wednesday, January 28, 2015

David Cameron wants to ban encrypted messages

A couple of weeks back, David Cameron gave the following quote:
"If I am prime minister I will make sure that it is a comprehensive piece of legislation that does not allow terrorists safe space to communicate with each other"
And on the surface of things, that seems fair. After all, right now, the Police and Intelligence services can issue a warrant and get access to postal mail and phone calls and who wants terrorists to be able to plot in secret?

but......

Unfortunately, David Cameron's quote goes further and specifically targets internet encryption:
"In our country, do we want to allow a means of communication between people which even in extremis, with a signed warrant from the home secretary personally, that we cannot read? “Up until now, governments have said: ‘No, we must not'."
This is a very frightening statement because it shows a lack of understanding of how the internet works.

For example, right now, as I type this I am on a secure connection, an HTTPS/SSL connection to blogger. Everything I type here is being encrypted and sent over the wire to where the blogger server is. This connection is encrypted thanks to mathematics. There is no trusted third party. There is no Royal Mail or British Telecom that holds a master key to access everything.

Now, lets say that David Cameron gets his way and, within the UK, there is a law that says all encrypted messages should be readable by the security services. Just how will this work when some services are outside of the UK? OK, America will play ball but what about if I put a server in Asia? This sort of thing is very easy to do with the abundance of cloud computing resources.

Also, what actually constitutes a "message"? If I connect up a secure, encrypted VPN to my work place and send out an email is that a message? Would that break Camerons' snoopers charter law?

What if my bank sends me a notification about a new product for my account when I logon to the their secure site? Is that a message?

To provide the security services with a "skeleton key of encryption" would require putting a genie back in the bottle that was released when Phil Zimmerman published PGP. To undo that now is impossible and, in many ways pointless. 

Why pointless?

Well, if encrypted communications are banned/have a master key why wouldn't terrorists send images and use Stenography to hide it? Why wouldn't they spin up a cloud based server and leave messages directly on that server to each other? No encryption needed and if a pre-arranged code is used it would look just like random text, much like the Bayesian rule avoiding text you so often see at the bottom of spam email. Such a message could even be embedded IN a spam email.

And if there is a master key to all encryption, what would happen if it got out? And, of course, one day it would even if no one released it. It would become the single biggest hacker focus simply because of the bounty of information they'd get access to with ONE encryption key.
Suddenly every single secure communication in the UK (except Governments probably) can be read by someone with access to that key. All that data, just there for the taking.

This suggestion from David Cameron is beyond ludicrous, shows a serious misunderstanding of how encryption works and misses out on the fact that the internet is a global phenomenon.

I suspect it's just a feel good sound bite but it should show that the Conservatives have no grasp on e-commence and nor do they have any experts that they have spoken to before hand.
This either makes them totally inept or very dangerous. Only time will tell.

Thursday, May 30, 2013

VMWare VCenter 5.1 SSO - The worst installer out there?

I've spent the past few days attempting to install VMWare's VCenter 5.1 but the installer has other ideas, I first tried the install from build 947939 and each time I am hitting the same error:


This was after trying the simple installer which is supposed to do everything for you and instead just fails with bits of the install done and a nice big error in the install log for SSO:

    Caused by: java.sql.SQLException: Cannot open database "RSA" requested by the login. The login failed.
    Caused by: java.sql.SQLException: Login failed for user 'RSA_USER'.



Build 1123966 which was released just a few days ago fixes a few things. Finally, the installer understands what a dynamic SQL port is which is huge progress but again, Error 29114 rears it's ugly head.

Further digging into the vminst log shows this:

VMware Single Sign On-build-1123962: 05/30/13 00:44:45 ShellExecuteWait::done (wait) Res: 14
VMware Single Sign On-build-1123962: 05/30/13 00:44:45 RunSSOCommand:: error code returned is 14 while launching C:\Program Files\VMware\Infrastructure\SSOServer\utils\rsautil.cmd
VMware Single Sign On-build-1123962: 05/30/13 00:44:45 Posting error message 29114

According to VMWare the fix is listed in this article and that fix is as follows:

Cause
This issue is occurs when the SSO database RSA (default SSO DB name) does not meet the prerequisites for the SSO installation and the RSA DB table space RSA_INDEX has a filegroup type of PRIMARY, instead of RSA_INDEX.

Resolution
To resolve this issue, change the filegroup for the RSA_INDEX table space to RSA_INDEX using SQL Management Studio and then install vCenter SSO.

Contact your Database Administrator before making changes to the database.
Small problem though:


Yep, the filegroups on my SQL server for the RSA database are already set correctly.

There are a few suggestions out on the web that also suggest changing part of the SQL script to create the index in an MDF rather than an NDF but this never worked for me.

At the time of writing I'm still bashing my head against this issue and getting more annoyed at the terrible installer which not only is broken but compounds the breakage but putting information into several different log files.
 
VMWare SSO has to be the most fiddly, broken installer that is required for an application that I've ever come across. It seems obvious that VMWare understand that SSO has it's share of issues as 5.1U1a came hot on the heels of 5.1U1 but the installer still leaves a lot to be desired, still breaks in the majority of situations and still is a nightmare.
 

Wednesday, January 09, 2013

20 little rules for surviving in IT Projects



Below are just a few things I've come across in almost 19 years of working in IT. After a chat with a good friend who is considering breaking into It I thought it would be worth jotting them down.

1. Take time to document and keep a copy of the document. Make sure you document issues you had, that part will be invaluable in the future. When you document something make a note of software versions. You'll often be surprised at differences in 'point' versions. These differences can be the key between something working and something going majorly wrong.

2. Never be the person who says 'yeah, I remember seeing that error once' Document it. You'll need it some point. A well structured wiki or notes system can cut  future task time in half simply by recording known issues and fixes.

3. Never lie to clients be they internal or external. You don't have to tell them everything but a direct answer to a direct question, admitting when you are wrong and/or don't know something will win you points for honesty, I also call it being professional.

4. Going the extra mile is good, often that extra mile isn't staying until silly o'clock but providing some hand holding during a major piece of work or just some notes on why you are changing the config of something. Communication is vital.

5. Saying "its urgent" is often code speak for "I didn't plan and am giving you this problem". If you already have documentation on how to fix something flagged as urgent and you have the time then go for it, this can be worth major brownie points.

6. Sometimes something really is urgent. A major trick is knowing the difference between 'its urgent' because it really is and 'its urgent' because they didn't plan and need it now. This comes with experience and listening to people.

7. Weekend work, whilst unpleasant, can be made easier with planning and pre-staging. All weekend work should have a back out plan/rollback plan. Never be afraid to call for a rollback. You may not be popular with project mangers but that's preferable to digging yourself into a hole you can't get out of.

8. If you lack documentation for something (and at some point you will) then make your own. It will have gaps but you'll have a starting point to build on. See point 1

9. No matter who takes the minutes in a meeting make your own. they don't have to be anything official but a few notes often go a long way when, 6 months down the line, when someone asks about a point from that meeting/that client you'll have notes.
Something as simple as a notebook that never leaves your side and has name, date, attendees, client and actions as well as notes, diagrams, etc will be a major asset when you need it.

10. Prioritise. Most managers cant do it for you because they don't have the tech skills to know how something works, if they demand something be done asap, they are the boss so do as you are told.

11. Organise your email. As an IT person email is often the key way people communicate and where all the alerts go. Organise it with rules and alerts to keep on top of what is important and what is noise. Don't be afraid to consign those management prep talk emails to noise, its better to find a system problem than learn the latest internal buzz.

12. It's  rare to find someone in IT who will share knowledge. If they are happy to share knowledge make the time to learn as its always good to be able to take a step back and see the big picture of something. Also, if you can talk to other people, not just it people, using their own terms not only will they will be more willing to listen to you but they are more likely to help you. Be ready to share your own knowledge. It people who hoard knowledge are often scared of being caught out. If you are caught out you've just learnt a valuable lesson, accept it with good grace.

13. Listen. Be it to customers (internal and external), other people, meetings or to a point, gossip. Gossip can and will tell you more about what is going on in a company than regular status updates and from that, you can infer the likely impact to systems under your control. Gossip is sometimes the best measure of where to put your energies than anything else!

14. When you do a project for a non-IT person, other company, users, etc they are trusting you with one of the most precious things they have - their data.
Don't let them down. Listen to them, understand what it is they do with their data and explain to them, in plain English - with diagrams if required - just what it is you will do with it and why it'll be better. Over time this will gain you both trust and a reputation for being professional.

15. Any backup is only as good as the restore procedure. Any restore procedure that has not been tested in the last 12 months should be considered as just words in a document that has no basis in reality. This goes double if software/hardware has been upgraded.
Even if it's up to date, validate it. Its best to verify and understand the gaps before getting to the point where you need the restore and finding the gaps then or that a key step is missing.

16. Project plans are often just big task lists with no meaning to the dates. Understand the real deadlines in a project and the reason for those deadlines. When project managers start pushing you'll know the real score - this is better when you understand the customer and their job - see point 14.

17. At the end of a project do your own lessons learnt. Too often lessons are not learnt from lessons learnt. Your own version should highlight how to deal with certain blocks you encountered in a project especially if they were blocks put up by colleagues, it'll help with future project work.

18. Avoid the following names: old, new, temp. They will bite you and if a project is halted mid work for some reason when you go back to it which areas are valid? the old ones? New ones? Temp ones? Meaningful names will save headaches in future.
If you are saddled with someone else's poor naming convention then document it - see point 2!

19. If possible, test changes to live data before doing the change on live data. If its not possible make and test a backup. Saying "it'll be fine" is the IT version of "look at me!" Often with the bone (data) shattering crunch at the end.

20. When you asked to estimate time for a piece of work don't estimate time for just that piece of work. Add in things like documentation, internal processes, liaison with others, interruptions and then double it.

Friday, January 04, 2013

Test labs and internet explorer patches

Test Labs

Test labs seem to be somewhat in focus at the moment with my workplace deciding to dedicate a rack to a test environment that'll hold a Netapp cluster, cisco switch, brocade switch and a couple of servers running an ESX cluster.

With ESX added to the mix this is sufficient to allow for rapid deployment of servers which will allow testing for a huge variety of installations and more importantly allow for re-validation of existing solutions where those solutions have upgrades.

A classic example of existing solution problems is something I'll touch on later in the year but, in brief, it seems that the latest version (6.0.4) of Netapps 'Single Mailbox Recovery' product doesn't work with certain versions of Snap Manager for Exchange. A downgrade to 6.0.3 is required and it is because of very silly incompatibilities/version issues like this it's vital that every now and then existing solutions are revalidated with the latest OS versions, patches, software and various combinations thereof!

Personally, I am a huge fan of the HP Microserver and with the £100 cashback offer it's a good choice for a simple personal test lab. My own testlab consists  of a couple of microservers with some additional network cards and running VMWares ESX 5.0, a FreeNAS box running FreeNAS 8.3-P1 with 5 x 2Tb hard drives and one running Windows 2008 with Starwind as Starwind is free and provides a nice iSCSI target to play with.

In my opinion, having a test lab whether at work or at home is vital to do procedural and system testing  and projects should always have the time built in to allow for testing where software needs to upgraded or where completely new solutions have been suggested as it's always nice to know about and document the know issues before running into them on a production environment.

Personally, I like having my own test lab at home as many times I've seen work testlabs that are a total mess with a mixture of production and test systems as well as stuff that no one knows about but are too scared to delete - just in case. All too often, I've found that a home test lab can turn something around faster than at work which makes a work from home day quite a useful experience and, of course, it can be used for upgrading skills or as a refresher prior to an interview.

Internet Explorer Patches

The first patch Tuesday of 2013 will be here in a few days and this one is going to be fun with no less than 12 security holes being patched except for a zero day security hole found on Christmas day and currently being exploited. All versions except version 9 of IE are affected so it seems that the general consensus is that an upgrade to IE9 is a wiser move.

Of course, at this point a lot of people will make the same cry 'Don't use Internet Explorer' which is fine until you consider that A. some software REQUIRES it and B. Browsers like FireFox have addons that use Internet Explorer components to render patches that require Internet Explorer thus making the computer vulnerable anyway.


Monday, December 03, 2012

The four constraints of a Project

Today I came across an interesting blog article on Wayne Hales blog. It's about a comment made by Admiral Gehman (Head of the CAIB) regarding Space Shuttle Management at NASA. I've copied the key part below:

The program manager essentially has four areas to trade. The first one is money. Obviously, he can go get more money if he falls behind schedule. If he runs into technical difficulties or something goes wrong, he can go ask for more money. The second one is quantity.  The third one is performance margin. If you are in trouble with your program, and it isn’t working, you shave the performance. You shave the safety margin. You shave the margins.  The fourth one is time. If you are out of money, and you’re running into technical problems, or you need more time to solve a margin problem, you spread the program out, take more time. These are the four things that a program manager has. If you are a program manager for the shuttle, the option of quantity is eliminated. There are only four shuttles. You’re not going to buy any more. What you got is what you got. If money is being held constant, which it is—they’re on a fixed budget, and I’ll get into that later—then if you run into some kind of problem with your program, you can only trade time and margin. If somebody is making you stick to a rigid time schedule, then you’ve only got one thing left, and that’s margin. By margin, I mean either redundancy—making something 1.5 times stronger than it needs to be instead of 1.7 times stronger than it needs to be—or testing it twice instead of five times. That’s what I mean by margin.

What has this got to do with IT projects?

Well, a lot actually. In IT projects we are always up against the same sort of constraints and in IT projects there are the same four areas that can be traded:

  • Time
    • The timeframe for the project, including testing and documentation.
  • Money
    • The budget for the project including overtime.
  • Quality
    • The actual 'fit for purpose' result of the project.
  • Scope
    • What needs to be achieved within the time frame and with the money provided.


All four areas are linked and a change in one has an impact on the other three, for example, Reducing the time but keeping the scope the same will impact the money and quality side of things whereas increasing the quality either increases the time (due to the additional validation tests) or reduces the scope (to ensure that what is delivered is as accurate as possible).

Now, on most IT projects the two things that are set in stone is the budget and the time frame.
All too often the time frame is arbitrary and so not related to any valid reason for having said time frame. The delivery deadline is generally picked and handed down simply because a senior manager wants to see something by date x and these dates are normally tied to budget cycles rather than to the complexity of the work.

On the flip side the budget is often set at the start of the project and long before the software and hardware requirements have been thought through. This does leads to cases where additional licences are required and yet cannot be purchased because the budget has been set and so workarounds have to be developed and this affects the quality and time side of the project.

Many projects are presented that have the time (deadline) and budget set.
In many the scope as well has been pre-determined, even more often the scope will grow during the project.
Managers/clients will want to shoehorn things into the project that were originally not in the scope.
This is normally seen by management as a way of saving money due to them adding more to the scope but not increasing the time or budget allocated.
Even more often, during the project it is found that a dependency on a system not in scope is affected by a system that is in scope and so suddenly the not-in-scope system has to be added to the scope.

This means that the only thing that we, as project staff, IT admins, consultants, etc have any actual control over is the quality and the only control we have is to reduce the quality of the work in order to meet the demands of the time, budget and scope.

Even more often the time factor becomes such a pressing issue that overtime is demanded/pushed to such an extent that quality is automatically sacrificed on the altar of  'getting it done' and, of course, as I said above, you can't change one thing in isolation without a knock on effect of the other three. The quality decreasing will have an impact on the time as things need to be fixed, the budget as more staff/more overtime is needed and the scope as things are sometimes dropped in order to meet the deadline again.

Absolutely none of this is news and I suspect everyone who reads this and works on some sort of IT project is nodding theirs heads because they have been there yet why do we, as IT professionals keep getting caught in the same trap? It's not even as if anything I've written here is new - Tony Collins wrote a book in 1999 which goes into some detail about several major IT disasters. 1999 was almost 14 years ago so why do we still have the same mistakes?

It is long past time that IT became a traditional engineering discipline and that engineers working on projects spoke up when a scope, schedule or budget was obviously written in fantasy land. Let's start doing projects once and doing them right.