“IT is at the heart of business these days and there are real opportunities now to have a career in IT which will ultimately lead to a position on the board.”
If this is the case, why are so many IT jobs filled with people who have no idea what they are doing? I spoke to my share of IT reps from firms all over the Fortune 1000 and Fortune 50 that had no clue what they were doing, nor did they have any idea where they were going with their mandates. Often they had no plan or action plan.
One example really sticks out for me; a hardware changeover plan that had no “buffer” the IT rep wanted to replace an important firewall with another one. He felt assured that he could just replace the current device with a new and wholly different one if the new devide was configured correctly.
This was a bad plan for two reasons:
1) There was no fallback beyond dropping the old hardware in place.
2) The router was the MAIN ingress to their websites and mail systems. There were no external fallbacks or alternate sites for users to visit during the downtime. If the transition went BAD (new hardware fails and old device breaks during transition) there was no fallback.
I know, you’re thinking: Kevin, what would you have done?
I would have published a new set of DNS records with a TTL of about 15 minutes. I would publish them a week before I made the transition and made sure my DNS server was not inside the new router. Once in place you would have 15 minutes of downtime while you performed the transiton to a new host for your website if something went wrong during the switch. That’s fairly easy to deal with.
I like the idea of planning for downtime like that; you could even change the TTL on the DNS records back to 24 hours when you are done.
Here are some tips for outage planning
- Have a fallback plan for total failure:
If it is an internet enabled service that users need access to, publish DNS records that point to a “Server is down” page on the net (for web services) when the primary record(s) is/are down.
Keep offsite hard copies (by hard copies I mean stored on Hard disk or Tape)
Keep enough cash in the IT budget to buy server time on multiple hosts should short-term downtime become extended overtime.
Any server that is important enough to serve all your needs should have a clone on hand with all the same data, backed up every 6 to 12 hours (or less) so that if your primary server(s) go down a clone can go online in seconds.
- Announce the outage in as many ways possible. Email is never enough for big outages. Warn users in cloud writing if you think they will read it.
- When the outage is going to take a machine out of service forever, contact any old admins and/or users and determine if they have stored anything important on the box. You never know.
- Treat every outage as a potential crisis and be ready for complaints regardless of success or shortness of time.
- Confirm that all parts and plans are in order before the outage in underway, if at all possible create a schedule and checklist for the outage that creates a series of milestones and ETAs that can be delivered to end users and managers.
After all, you are the heart of the business when you are in IT, right?