Good sysadmins use automation and monitoring, but good sysadmins don't rely only on this. The problem with automated monitoring, as we know from the movie Jurassic Park, is that you only find what you're looking for. Where are those pesky raptors, anyway?
The other night I was looking at the process list on a server and saw there were several copies of the nightly backup script running. I found that the script was trying to upload the backup tarball to a remote NAS, and as part of this process, it tried to delete the previous week's tarball. This file didn't exist, so the command was failing, but the script wasn't smart enough to handle this, so it just kept retrying. As a result, the backup never got properly copied offsite, and the machine was starting to become overloaded with backup jobs.
Why is this important? Many monitoring systems would not pick up this condition. Would yours? You may want to improve your monitoring to detect at least the following:
- Check that the job actually ran by looking at the modified date on the job's logfile.
- Check that the job terminated (look for a completion message in its log) and check its exit status.
- If the job creates a file, check for the existence and the size of this file.
- If something gets copied offsite, check the offsite copy and make sure it exists and matches the source file.
You get the idea. No automated monitoring system is perfect. A great way to improve it is to look at your servers with an open mind, and when you find problems you didn't expect, add monitoring checks to catch them. Jordan's article on Cron Practices is a good place to start improving the reliability of your cron jobs.
I found several other little gotchas in the backup script which needed fixing. It's easy to assume that a script that's been in production for years is free of bugs. That's almost certainly not the case, and many of those bugs won't be found until you go through the script with a critical eye. If you use version control and test-driven development for your scripts, as recommended by Hugh Brown in his Development for Sysadmins article, this will be a big help.
A good sysadmin is a good detective. You should make time as part of your job to pick on a random machine every so often and examine it forensically, like a crime scene. What's here that shouldn't be? Are things working like they're supposed to? Are there any problems with this machine that have gone unnoticed by the automatic monitoring?
Here are some places to look:
- Logwatch output (ideally, have this mailed to you and read it every day).
- Root's mailbox: errors from cron jobs will come here, as will mail bounces.
- The syslog is a great resource, but often very large and hard to read. Try configuring your syslog.conf to send messages from different subsystems (mail, for example) to different logfiles.
- The process list (
top). What's running that you don't expect to see? Is anything using an unusual amount of CPU? Is the machine spending more time in iowait than it should (may indicate a disk problem or a degraded RAID array).
- The crontab (not all jobs may run as root, so go looking
/var/spool/cron). What jobs are running? Are they working? Should they be there? Is their output logged and monitored?
- netstat, iostat, and vmstat anomalies.
- Places you (and I) wouldn't think of. Go snooping. See what catches your eye. A detective looks without preconceptions, so she's ready to spot the unexpected.
I hope the above makes you suspicious and paranoid, in a good way - a way that makes you a better sysadmin. Anything that can go wrong will go wrong, and is going wrong somewhere on your network, right now. If you get into the habit of looking for trouble, you're sure to find it! Finding trouble is better than trouble finding you.