The Bit Bucket

Monday, September 28, 2009

Upcoming Week

I already know it's going to be one of "those" weeks, but then I guess it always is, there is always something that turns a nice quiet, productive week into a nightmare. This week hasn't even started properly yet and it already has the feel of problems.

Put simply, this week is going to be all about shuffling data from one location to another, the sheer joys of being taken over and having to comply with another companies standards. It doesn't help that our permissions are a total disgrace. If you think of the standard rules of users into groups and assign the groups permissions.. well, we haven't done that. we don't even come close so permissions changes are proving to be a stumbling block.

We also have a subsidiary company that has been bought out by another company. Data transfer to the new location is proving to be another interesting experience, I cannot, for the life of me, understand why people cannot get organised and actually work out before hand what it is they need and so this is leading to a lot of last minute requests... situation normal I suppose.

Add this to the fact that Netapp have discovered another bug in their firmware this time and this bug is a good one. It seems that only on the 3070 cluster the bug will result in a reboot of the filer leading to a panic. Nice. I had this happen to one of the filers in Cambridge and now it seems that the only way to fix it is by being onsite with a laptop and a serial cable....

More updates coming soon and I'm thinking of changing this blog into more of a war stories blog with the occasional bit of technical information rather than leaving it a few months between updates.

Labels:

Sunday, July 05, 2009

Why change control can be a Bad Idea

I'm sure that many people have had to endure the torture that is a change control process. In short, change control is a process whereby changes to a system have to be approved by a change control panel.
Generally, the panel is a group of people who probably don't know about the system and/or don't much care about it so you have to be quite vocal in why and change might be needed.
The actual ideal behind change control is to moderate changes to a system in such a way that should a system fail or have problems it should be possible to use the change control tool to work out what recent changes had been applied and undo those changes or research/test to see if those changes could be the root cause the issue. Sounds ideal doesn't it? In reality it never works that way.

In the long run a change control tool can actually do more damage that it's designed to prevent. How come?

Let's say that you have a website which has a fairly minor bug. Let's say that you know that the change control process will take two weeks to follow it's winding path and that you will need to invest about four hours to write and represent what is, at best, a 20 minute change.

What do you do?

Do you spend the time and fix the bug or do you forget about it and press on with the several hundred other things that's on your list?

Let's say you pick the second option (and I don't blame you if you do because I've done that) and several months later that small issue could explode to be a big issue.

And that's why change control systems need to be as flexible as possible otherwise what appear to be minor changes will be quietly shelved and in the long run that can lead to a major incident.

I'll provide some suitably altered real world examples in a future article.

Labels: ,

Wednesday, June 10, 2009

Software is not a panacea - Part 2

In the previous article I raised the fictional scenario of a company wanting to automate a timesheet submission process. In this article I'd like to touch on some of the project processes that would be used by the majority of companies.

Generally, most companies will start off with the sensible process of evaluating existing software packages, looking at what's out there and maybe even seeing what other companies use. After a period of time a sensible company will come to the conclusion that there is no one piece of software that fits their requirements and so their requirements must change as well as some processes. This is a key point as every company likes to think that they are unique and so around that uniqueness certain process have appeared so when it comes to upgrade or computerise those processes they are reluctant to change them.

However, back here in the real world most companies will do one of three things, they will
  1. Abandon the idea
  2. Buy the commercial package closet to their requirements and get it customised
  3. Hire a developer to write a bespoke piece of software
Of the above three options the first is the best and safest but at this point many companies make another fundamental mistake. They never document the issues found or the reason for the project to be abandoned. This means that often someone else will reopen the project 6 to 12 months later, reinvestigate options and then select option 2 or 3.

Option 2 is an interesting one, surely there can't be much wrong with making some customisations could there?
Well, it depends. If the software is designed to allow those customisations then go ahead. However, may companies will want to alter certain business logic (e.g. maybe three people would have to approve a timesheet and the system, by design, only allows a maximum of two.
Quite often a company will purchase development skills and get the codebase changed to support what they require. This causes a problem when upgrades are required or if a security hole is discovered as often the customised verison will break when patches for the mainline system are applied if it's even possible to apply them at all.
Now the company ends up in a situation where they like and want the features in the next version but are tied to an old version due to the customisations, often they will have to face the choice of staying with the customised version, migrating to the new version or paying out to get the customisations in the new version.

Option 3 opens up all sorts of interesting possibilities for problems and complications to occur but I'll save that one for another blog

Labels: , ,

Friday, August 01, 2008

Why Total Cost of Ownership is a fallacy

If I have one more potential supplier try and sell me something on the lie that it will "reduce TCO" I will not only scream but I will beat them to death with a CAT 5 cable.

Total Cost of Ownership (TCO) is one of those almost unmeasurable values that seems to have pride of place in the salespersons portfolio. How do they KNOW a new system (with it's associated equipment, licensing and training costs) will work out cheaper than the old one?
The idea is that newer systems have better support so rather than training someone in an older system and maybe having to buy in more expensive skills more legacy systems it works out cheaper to upgrade or replace with the latest model.

I don't disagree that for some systems which are truly legacy such the old DOS or OS/2 application may well work out cheaper in the long run but the one thing that will truly reduce TCO?

  • Understand your systems.

  • Take time to test and document the fixes.

  • Use your call logging system as a knowledge base.


  • These three tips alone will truly reduce TCO.

    Labels: ,

    Sunday, July 13, 2008

    Legacy Systems and a very handy SQL comparrison Tool

    On Friday, I had the "pleasure" of having to get a legacy system up and running.
    This system was originally introduced to allow users in the business to manage group membership for projects they had ownership of. The idea was that it would cut down user calls to the service desk by about 10% and allow the project managers themselves to get a speedier turn around for new starters.
    Sounds fine in theory and in the world of NT4 it wasn't a problem. Move on to the world of Active Directory and things are a little different. The legacy system (Bindview v4.6) has been superceded about 5 times over but we can't just install the latest version. Trust me on this, the latest version is fine but there are many design decisions and compromises as well as several rejections for upgrading the system from a few years back that have all combined to lead to the current problem.

    The actual problem was an interesting one. The system was complaining whenever anyone tried to edit a group. A restore of the back end SQL database fixed the problem until the next domain sync occurred when the database would corrupt itself again.

    Obviously, the sync was pulling something from the domain that it didn't like.
    For the first attempt at a fix I fired up SQL Trace which records every single SQL statement that goes to a selected database. The neat thing about Trace is that it's possible to point the trace results to a SQL database itself and then filter it to get rid of stuff you know isn't going to help - such as SQL agent tasks and so on.
    Trace left me with a multi-variable SQL script spanning over 4,000 lines and quite difficult to read or even test so I decided that the next best thing was to restore the working database to new a database name and then find a tool to compare every object on the bindview user table to see what was different between the restore and the one that synced with the domain and promptly broke.

    AdeptSQL was the third tool I tried and whilst it has a very simplistic point and click interface it's incredibly powerful for comparing two SQL databases. Once the comparison is done you get two side-by-side windows which represent the two databases. Changes are highlighted by colour - Red for deletions, Blue for new and black for no changes.
    This left me with a 2,000 list of changes, deletions and amendments in the database.
    AdeptSQL also lets you filter things out and by using these features I eventually tracked the problem down to the description field of two user accounts.
    These accounts had spurious characters in them which Bindview being rather old and totally ASCII prompt fell over on. Removing these and waiting for a resync solved the problem.

    Whilst AdeptSQL helped me solve that particular problem there is still the problem of this legacy system updating Active Directory whilst not being active directory aware which leads to some other fun and games with the display name versus the SAMAccount name but more on that in a later article.

    Labels: , , , ,

    Monday, September 24, 2007

    If in doubt, reboot........ the train........

    My journey into work is normally quite uneventful. Since the move out to Kent it generally takes 20 minutes longer but the journey is actually fairly pleasant. Today was the exception.

    About 20 minutes into the journey the trains brakes come on pretty hard slamming the train to a stop and we sat there for a couple of minutes before the guard come onto the tannoy to explain that there was a problem with the trains brakes (really?!) and that there were going to try a fix... This is the point that they REBOOTED the train. I kid you not, the annunciator at both ends of the coach went out, the air con died and the lights all went out......... A few minutes in the quiet and everything came back on but I would have loved to have seen a BIOS start up message scroll across the annunciators!

    As a side note in this case the fix didn't work and the train was taken out of service at Orpington but I swear that's the first time I've been on a train that's needed a reboot!

    Labels: ,

    Tuesday, August 07, 2007

    Bad Project Management

    Sometimes this industry makes me want to scream. My old favourite the artificially tight deadline has been back in force this week with a project due to finish at the end of the month being shortened to the 23rd and now further short end to this Friday.
    Obviously, in order to deliver the project will have to skip most if not all of the testing. Problems will occur in a very user facing environment and there will be no pre-learnt knowledge of failure modes which in turn means a very steep learning curve.

    of course, chances are everything will be fine. Chances are testing won't uncover any major problems, chances are the testing can be deleted with no obvious impact to the systems.

    however, without testing its impossible to know, without testing the little oddities that do crop up during the operation of a system cant be found or at least recognised.

    a second really annoying part is that the project management tool we are using requires testing to be added so all the tests have been carefully thought about and added and now they won't be used.

    this is my bosses boss demanding this so do I get on with it knowing the system will be inferior or refuse unless its properly tested?

    At the end of the day I consider myself part of the engineering community with standards and a pride in my work so I will make a noise but fights like these end up leaving me drained, tired and wondering why the hell I still work in this sector.

    This project doesn't need to deliver early. It just makes the stats look good and I am now fed up of working long days to fix something that should not have been delivered broken in the first place.

    Labels: ,

    Friday, May 18, 2007

    The week from hell

    Ok, so it's more nine days than a week but you know how people say trouble comes in three's? Well, I've almost had three lots of three over the past nine days. I'll go into further explanations for some of the more technical ones but so far this is the list of problems from just over a week!

    1. Power supplied died in my Freeview box
    2. My DVD player died (Ok, It's been on the way out for a while)
    3. Weekend works did not go as planned due to an oversight
    4. I found two bugs in Data ONTap (NetApp's proprietary operating system). One caused a filer panic
    5. My Domain controller at home died and the backup image for it is causing me problems
    6. An upgrade of a server caused a database to go bad. That took three days to get back and it's still playing up a bit
    7. I got run over by a cyclist who are (according to the police) running people over in order to snatch phones.
    8. Work Laptop

    FreeView Box
    I have a Digivsion FVRT150 freeview box, This has an internal 80GB hard disk which is great for recording programmes and has worked flawlessly up until a week ago when all sorts of odd noises started coming from it which was caused by the unit not having enough power to work. Apparently it's a common problem and there is a site called XtendedPlay that sell replacement power supplies. I bought one and it's all working perfectly now

    DVD Player
    After having a lot of issues with videos I bought a combination DVD & Video player a couple of years back. The DVD player on this works but the mechanical rollers which allow the tray to eject have died so I bought a new slimline dedicated DVD player from Amazon. The prices on these have REALLY come down as it cost just £17.

    Weekend Works
    The company I work for planned a maintenance weekend to upgrade a FAS940 filer with a clustered FAS3070 unfortunately no one took into account the vFiler which is part of the existing filer.
    For those that don't know NetApp, filers are basically bit storage cabinets of disks and they allow some clever tricks such as iSCSI, proper quotas and so on.
    They also allow the creation of vFilers. A vFiler is a virtual filer or a 'filer within a filer' and its useful for segregating data. Unfortunately vFilers and VIF (which are multiple interfaces joined aggregated into one connection) don't work together. The new clustered environment was planned to use nothing but VIF's.
    During the work the onsite engineer from NetApp decided to create a single VIF, that is a single connection as part of a VIF group so that when the vfiler was migrated elsewhere the connection it freed up would be able to be added into the VIF group and it should all just work.
    Wrong. It turns out due to a probable bug in the Operating System a single interface in a VIF group will not work.

    Second filer bug
    One thing filers do quite well is pretend to be windows boxes (via CIFS or Samba if you prefer) or pretend to be linux/solaris boxes via NFS. Unfortunately there is a bug in the version of the operating system that we run which can, under rare occasions, cause the filer to panic in certain CIFS operations.
    Somehow we triggered that situation and the filer crashed. Fortunately the cluster worked and the second filer head took over the load.

    My domain Controller
    In several articles I've said that I only run one domain controller and back it up roughly once a week as rebuilding the server is quite easy. Well, it looks like I might have to go back on that as my domain controller died in the week and I can't access the data in the backup. Fortunately I do have a VMWare image of the server which is working but DNS is broken so recovering the domain controller is proving to be 'fun'..... Obviously, I shall re-evaluate that second DC!

    Database upgrade
    During the aforementioned maintenance weekend the decision was made to install SQL 2000 SP4 onto all the SQL servers and onto all instances on the SQL servers. This went well but one system we use - bindview, which is a delegated access tool took a turn for the worse. It was then we found out that no one knows the password that bindview uses to talk to SQL. Ok, simple. Change the password and reset it in bindview but you can't do that without reinstalling the software and you can't reinstall the software on our bindview server because it's REALLY only meant for NT4 and not the active directory (but NT4 emulated) environment. One hell of a restore later (sever, database) and a clever hack of the hashed password out of sysxlogins and it was fixed but it was an interesting time.

    Run Over!
    On the way home the other night a cyclist went into the back of me then after a slinging a punch went dashing off. I reported it to the police and was told that it's becoming common. The idea is by riding into the back of someone they either knock the phone out of the persons hand or knock the person over and they can then grab the phone and ride off. Right now, I'm sporting a lovely set of cuts down the back of my leg which have all been treated and should heal up quickly.

    Work Laptop
    for some reason the laptop I use at work decided to go slow, Firefox locked the processor at 99%, killing it then locked another process at 99% and so on until winlogon locked the processor at 99% - Something was obviously interfering and causing problems. I'm now in the process of rebuilding the laptop.

    And the week is not yet over!!

    Labels: , , ,

    Friday, March 09, 2007

    A Novel Idea - Let's do what the business wants.

    I sat in a meeting the other day where a whole new project priority scheme was unveiled based around the unique idea 'Deliver what the business needs'.

    The idea was greeted with thunderous silence. I'm sure most of the people in the meeting with me were thinking the same thing "As an IS department and as a SERVICE COMPONENT of the business-at-large weshould be following this model anyway?!".

    Certainly there will be projects that IS needs to concentrate on that the business will not directly use and/or see no value in. Those projects are things like networking monitoring and infrastructure upgrades. The business-at-large do not benefit directly from the project but they benefit from the knock on effect of having the IS department respond to problems reported by a good monitoring system before the business feels the impact and they gain by the increased speed/benefits of a better infrastructure.

    Ultimately, whatever projects an IS dept runs will need to be justified to the business, sometimes on a case-by-case basis and sometimes IS can lose out - For example, if a web monitoring system is delivered in place of an upgraded payroll system IS can be moaned at for choosing a system that hinders as the priority over a system that will help.

    The solution here is to ensure that ALL projects you are running are fully visible to the business. Let them see what's going on. Let them see WHY the web monitoring system is more important than the new payroll system. SHOW THEM why infrastructure in one area needs to be upgraded to support the new payroll system.

    The more IS communicates with the business and justifies actions the more the business will come to trust the IS dept as a bunch of people who know what they are doing.

    Everything in IS seems to be a juggling act but there should always be room for clear, unambiguous English.

    Labels: ,

    Friday, February 09, 2007

    The strange case of the disappearing server

    Sometimes you have one of those REALLY odd requests... Today I was asked to fix a problem with a web link. A quick look showed that the DNS entry had been removed but no one could remember when it was done or how far back......

    A quick hunt around the change recording system had no mention of it so this meant the change occurred OVER 2 years ago.

    Further digging revealed that this particular system hadn't actually been accessed since something like March 2005 and that some managers thought the problem was down to an intranet 'look and feel' change which months after the system was last accessed.

    This actually means it was almost TWO YEARS before they raised the call since then most of the people that worked on the system have left the company.

    It's a bizarre world in IT sometimes.

    Labels: ,