- every <scheduled time>, it runs your command as some user
- output gets emailed to
- to prevent the same job from having overlapping execution.
- want emailed output only on failures.
- all output to be logged somewhere.
- some jobs to timeout if they run too long.
- randomize startup time to avoid resource contention.
For the rest of this article, we'll show various improvements to the following cron job that does a twice-daily backup of mysql.
0 0,12 * * * backupmysql.shThe contents of our backupmysql.sh are:
#!/bin/sh mysqldump ...For simplicity, we omit the mysqldump arguments. Let's get on to addressing individual problems.
Overlapping jobs - Locks
Overlapping jobs can be prevented using locking. Last year, we covered lock file practices which can be applied to solve this. Simply pick a unique lockfile for each cronjob and wrap your cron job with flock(1) (or lockf(1) on FreeBSD).Let's prevent two backups from running simultaneously. Additionally, we want to abort if we can't grab the lock. flock(1) defaults to waiting indefinitely, so let's set the wait time to 0 and use "/tmp/cron.backupmysql" as the lockfile:
#!/bin/sh lockfile="/tmp/cron.backupmysql" flock -w 0 $lockfile mysqldump ...
Emailed output only on failuresYou don't necessarily need an email every time your job runs and succeeds. Personally, I only want to be contacted if there's a failure. In this case, we want to capture output somewhere and only emit the output if the exit status of something is nonzero.
#!/bin/sh output=$(mktemp) mysqldump ... > $output 2>&1 code=$? if [ "$code" -ne 0 ] ; then echo "mysqldump exited with nonzero status: $code" cat $output rm $output exit $code fi rm $output
All output should be logged somewhereRegardless of exit status, I always want the output of the job to be logged so we can audit it later. This is easily done with the logger(1) command.
#!/bin/sh # pipe all output to syslog with tag 'backupmysql' mysqldump ... 2>&1 | logger -t "backupmysql"
Some jobs need timeoutsRun-away cronjobs are bad. If you use locking as above to prevent overlaps, a stuck or frozen job can prevent any future jobs from running unless something causes the stuck or very-long job to die. For this, we'll need a tool to interrupt execution of a program after a timeout. I don't know if there's a canonical tool for this, so I wrote one for this artcle.
You'll need ruby for alarm.rb. We can now apply this to our backup script:
#!/bin/sh alarm.rb 28800 mysqldump ...
This will abort if the mysqldump runtime exceeds 8 hours (28800 seconds). My alarm.rb will exit nonzero on timeouts, so if we use the email-on-error tip from above, we'll get notified on job timeouts.
If you have lots of hosts all doing backups at the same time, your backup server may get overloaded. You can hand-schedule all your similar jobs to not run simultaneously on multiple hosts, or you can take a shortcut and randomize the startup time.
To do this in a shell script, you'll need something to generate random numbers for you. Doing this explicitly in shell requires a shell that can generate random numbers: bash, Solaris ksh, and zsh support the magic variable $RANDOM which evaluates to a random number between 0 and 32767. You'll also need something to map your random value across your sleep duration, we'll use bc(1) and bash(1) here (Even though zsh's $(( )) math operations support floats, bash seems more common).
#!/bin/bash maxsleep=3600 sleeptime=$(echo "scale=8; ($RANDOM / 32768) * 3600" | bc | cut -d. -f1) echo "Sleeping for $sleeptime before starting backupmysql." sleep $sleeptime mysqldump ...
Now let's combine all of the above into one super script. Doing all of the above cleanly and safely in bash is not the most trivial thing. Here is the result:
Using cronhelper.sh is simple. It takes options as environment variables. Here's an example:
% TIMEOUT=5 JOBNAME=helloworld cronhelper.sh sh -c "echo hello world; sleep 10" Job failed with status 254 (command: sh -c echo hello world; sleep 10) hello world /home/jls/bin/alarm.rb: Execution expired (timeout == 5.0) # and in /var/log/messages: Dec 8 02:58:02 snack helloworld: hello world Dec 8 02:58:07 snack helloworld: /home/jls/bin/alarm.rb: Execution expired (timeout == 5.0) Dec 8 02:58:07 snack helloworld: Job failed with status 254 (command: sh -c echo hello world; sleep 10)
Now armed with cronhelper.sh and alarm.rb, we can modify our cron job. Let us choose an 8 hour timeout and a 1 hour random startup delay:
0 0,12 * * * JOBNAME="backupmysql" SLEEPYSTART=3600 TIMEOUT=28800 cronhelper.sh backupmysql.shThe new cron entry is now:
- logging any output to syslog
- only outputting to stdout when there's been a failure (and thus only emailing us on failures)
- staggering startup across an hour
- aborting after 8 hours if not finished
- locking so overlapping runs are impossible