A blind fix that restores health to a down or busted service can be valuable. If there are a known set of conditions that indicate the poor health of a service or device, and a restart can fix it, why not try it automatically? The restart probably doesn't fix the real problem, but automated health-repairs can help you debug the root cause.
Restarting a service when it dies unexpectedly seems like a no-brainer, which is why mysql comes with "mysqld_safe" for babysitting mysqld. This script is basically:
while true run mysqld if mysqld exited normally: exit
A process (or device) that watches and restarts another process seems to have a few names: watchdog, babysitter, etc. There are a handful of free software projects that provide babysitting, including daemontools, mon, and Monit. Monit was the first tool I looked at that today, so let's talk Monit.
Focusing only on the process health check features, Monit seems pretty decent. You can have it monitor things other than processes, and even send you email alerts, but that's not the focus today. Each process in Monit can have multiple health checks that, upon failure, result in a service restart or other action. Here's an example config with a health check ensuring mysql connections are working and restarting it on failure:
# Check every 5 seconds. set daemon 5 # monit requires each process have a pidfile and does not create pidfiles for you. # this means the start script (or mysql itself, here) must maintain the pid file. check process mysqld with pidfile /var/run/mysqld/mysqld.pid start "/etc/init.d/mysqld start" stop "/etc/init.d/mysqld stop" if failed port 3306 protocol mysql then restartThis will cause mysqld to be restarted whenever the check fails, such as when mysql's max connections is reached.
While I consider an automatic quick-fix to be good, this alone isn't good enough. Automatic restarts could hinder your ability to debug because the restart flushed the cause of the problem (at least temporarily). A mysql check failed, but what caused it?
To start with, maybe we want to record who was doing what when mysql was having problems. Depending on the state of your database, some of this data may not be available (if mysql is frozen, you probably can't run 'show full processlist') Here's a short script to do that (that we'll call "get-mysql-debug-data.sh"):
#/bin/sh time="$(date +%Y%m%d.%H%M%S)" [ ! -d /var/log/debug ] && mkdir -p /var/log/debug exec > "/var/log/debug/mysql.failure.$time" echo "=> Status" mysqladmin status echo echo "=> Active SQL queries" mysql -umonitor -e 'show full processlist\G' echo echo "=> Hosts connected to mysql" lsof -i :3306We'll also need to tell Monit to run this script whenever mysql's check fails.
check process mysqld with pidfile /var/run/mysqld/mysqld.pid if failed port 3306 protocol mysql then exec "get-mysql-debug-data.sh"However, now mysql doesn't get restarted if a health check fails, we only record data. I tried a few permutations to get both data recorded and mysql restarted, and came up with this as working:
check process mysqld with pidfile /var/run/mysqld/mysqld.pid start "/etc/init.d/mysqld start" stop "/bin/sh -c '/bin/get-mysql-debug-data.sh ; /etc/init.d/mysqld stop'" if failed port 3306 protocol mysql then restartNow any time mysql is restarted by monit, we'll exec the debug data script and then stop mysqld. The better solution is to probably combine both data and stop script invocations into a separate script you set to 'stop "myscript.sh"'.
If I run monit in the foreground (monit -I), I'll see this when mysql's check fails:
MYSQL: login failed 'mysqld' failed protocol test [MYSQL] at INET[localhost:3306] via TCP 'mysqld' trying to restart 'mysqld' stop: /bin/sh Stopping MySQL: [ OK ] 'mysqld' start: /etc/init.d/mysqld Starting MySQL: [ OK ] 'mysqld' connection succeeded to INET[localhost:3306] via TCPAnd in our debug log directory, a new file has been created with our debug output.
This kind of application isn't a perfect solution, but it can be quite useful. How many times has a coworker accidentally caused a development service to crash and you've needed to go restart it? Applying the ideas presented above will help you both keep from sshing all over restarting broken services in addition to helping automatically track crash/bad-health information for you.