Retrying Failed Gearman Jobs

The gearman job queue is great for farming out work. After reading a great post about Poison Jobs, I limited the number of attempts the gearman daemon will retry a job. This seemed fairly straight-forward to me: if a job fails, then the gearman daemon will retry the job the specified number of times. I learned the hard way that it was not that simple. There is specific criteria the gearman daemon follows in order to retry a job.

This all came about when I noticed a particular gearman worker was throwing an uncaught exception under certain conditions. I assumed that an uncaught exception would cause the gearman daemon to retry the job. I found out that not only did gearman not retry the job, the client was receiving a return code of GEARMAN_SUCCESS. In other words, the client had no idea the worker was blowing up.

The GearmanJob class provides some methods to inform the gearman daemon the result of a job. They are primarily used for synchronous jobs. The sendComplete method will cause the gearman daemon to send a return code of GEARMAN_SUCCESS to the client and can also be used to pass data back to the client. The sendFail method will cause the gearman daemon to send a return code of GEARMAN_WORK_FAIL. This may seem fairly obvious, but it is important to note that calling sendFail will not cause the job to be automatically retried. The client code would have to recognize a return code of GEARMAN_WORK_FAIL and decide whether or not to call the job again.

Then there is the sendException method, which will cause the gearman daemon to send a return code of GEARMAN_WORK_EXCEPTION to the client. Do not make the mistake I did by thinking this will implicitly be called if a worker throws an uncaught exception. The main difference between sendFail and sendException is that a string detailing the exception can be added to the sendException method. If you wrap a worker in a try/catch block, you can catch exceptions and call sendException with the exception error message. The sendFail method does not take any parameters and leaves the client guessing as to why the failure occurred.

The worker does not know if it was called synchronously or not. If the worker was called synchronously, using any of the aforementioned methods will allow the client to determine the status of the job. The client can then decide whether or not to retry a failed job. If the worker was called asynchronously, sending back the job status falls on deaf ears. Nothing is listening for the job status and the gearman daemon will not log failed jobs.

We still have no idea what criteria must be met in order for the gearman daemon to retry a job. I read some gearman mailing lists and perused the daemon source code and I think I have found a definitive answer. The worker must exit with a non-zero code during a job in order for the gearman daemon to retry the job. The strange thing is that an uncaught exception causes a php script to exit with a code of 255. Explicitly calling exit(255) will force a retry, but an uncaught exception will definitely not force a retry. In fact, an uncaught exception will not even cause a GEARMAN_WORK_FAIL or GEARMAN_WORK_EXCEPTION return code.

After some reviewing of the pecl/gearman code, I have found that the pecl/gearman worker code is not checking for an exception before returning GEARMAN_SUCCESS. I have submitted a bug report with a patch to at least return GEARMAN_WORK_FAIL when an exception is enountered instead of GEARMAN_SUCCESS. I do think there is an argument to be made that an uncaught exception should force a retry of the job, but I will leave that discussion for another day.

The best way to force a job retry on an uncaught exception is to simply use the exit() function.

function func($job)
{
    try {
        // work
    } catch (Exception $e)
        syslog(LOG_ERR, $e);
        exit(255);
    }
}

This will cause your worker to stop running, but so will an uncaught exception. Most gearman architectures have a monitor that will restart fallen workers. If you don't, get one and have it send out alerts if any worker exists with a status of non-zero.

April, 11 2011