It’s been too long; I’ve been so busy. I guess I’ll recap what I’ve done in these two years plus.
- Fixed over 100 incorrectly configured linux boxes so that they would actually send their admin the output of the logwatch command.
- Reconfigured the same to use ClamAV correctly and with consistent settings instead of the hodge-podge that they were.
- Built a custom monitor for a series of servers that allowed non-techs to determine if the servers in question were up or down. I’d come back to this
- Built a scripted installer for a 21 server farm, taking a 10-20 minute process down to a single command line. I’d come back to this too.
- Fixed the log backup system that had been in place for months. It’s still there now, but it needs to change.
- Got really into replacing complex manual functions with Bash scripts.
- Built the data import system for a whole client. SUPER complex and modular, didn’t use most of the code anywhere else save for the functions method.
- Got to know cron really well.
- Got to know ssh -t “command” really well
- Got lost in the weeds of random apps for random functions, the environment was becoming to large and entrenched to be managed remotely via a central console.
- Built an extensible console for managing the environment in part.
- Build Cache flushing tools
- Learned how to compile bash apps at the command line using shc http://www.datsi.fi.upm.es/~frosal/sources/shc.html
- This gave birth to a number of cool tools, remote fail-over tools that interacted with Cisco devices for example.
- The Web Console built earlier evolved and got better and better.
- Built automated localized monitors that could restart hung applications before remote monitors could catch the outage.
- Built automated localized monitors that could restart hung applications and NOT cause two systems to restart simultaneously.
- Installed DD-WRT a few times, lots of fun.
- Gave up some weekends
- Gave up some sleep
- Gave up Family Time
- Gave up Long Weekends
- Built a custom log handler for Apache logs, produced delightful daily csv from an environment, imported this into MySQL and created views to deal with that.
- Tried to hit the gym
- Got too busy for the gym.
- Trained up a replacement.
- Left things running okay.
Some Advice for IT Types
Published by NiteMayr on June 24, 2008If this is the case, why are so many IT jobs filled with people who have no idea what they are doing? I spoke to my share of IT reps from firms all over the Fortune 1000 and Fortune 50 that had no clue what they were doing, nor did they have any idea where they were going with their mandates. Often they had no plan or action plan.
One example really sticks out for me; a hardware changeover plan that had no “buffer” the IT rep wanted to replace an important firewall with another one. He felt assured that he could just replace the current device with a new and wholly different one if the new devide was configured correctly.
This was a bad plan for two reasons:
1) There was no fallback beyond dropping the old hardware in place.
2) The router was the MAIN ingress to their websites and mail systems. There were no external fallbacks or alternate sites for users to visit during the downtime. If the transition went BAD (new hardware fails and old device breaks during transition) there was no fallback.
I know, you’re thinking: Kevin, what would you have done?
I would have published a new set of DNS records with a TTL of about 15 minutes. I would publish them a week before I made the transition and made sure my DNS server was not inside the new router. Once in place you would have 15 minutes of downtime while you performed the transiton to a new host for your website if something went wrong during the switch. That’s fairly easy to deal with.
I like the idea of planning for downtime like that; you could even change the TTL on the DNS records back to 24 hours when you are done.
Here are some tips for outage planning
If it is an internet enabled service that users need access to, publish DNS records that point to a “Server is down” page on the net (for web services) when the primary record(s) is/are down.
Keep offsite hard copies (by hard copies I mean stored on Hard disk or Tape)
Keep enough cash in the IT budget to buy server time on multiple hosts should short-term downtime become extended overtime.
Any server that is important enough to serve all your needs should have a clone on hand with all the same data, backed up every 6 to 12 hours (or less) so that if your primary server(s) go down a clone can go online in seconds.
After all, you are the heart of the business when you are in IT, right?