Troubleshooting Guide

One goal of SyQADA is to run the components of a workflow with the fewest special software tricks possible, so that a person familiar with Unix and the Unix file system can monitor progress and diagnose problems without needing to know about database commands or other special features. Each job that runs is simply a bash script that wraps a desired command or set of commands and puts the standard output and standard error files in a common specific location so that you can determine problems in the way you would if you were running the same job from the command line. My claim is that the organization provided by SyQADA simplifies manual output and error management, so that the utility functions provided by syqada batch and syqada manage are merely syntactic sugar on what you could do yourself.

Throughout the following, substitute the name of your step for the term TASK-DIRECTORY.

You may ignore the next paragraph, because SyQADA now does this for you:

If you encounter an error, the first thing to do is to look in the
TASK-DIRECTORY/LOGS directory for a file ending with a *.failed*
suffix.  Now find the file with the same prefix and a *.err* suffix,
and examine it to see what the problem is (occasionally applications
will put their error messages in standard output instead. Check the
*.out* file if the *.err* file is empty).

Because Wait! In addition to the Ginsu knives, SyQADA also provides a utility function to do this and more:

syqada errors TASK-DIRECTORY

will summarize all the errors found and classify them, identifying those with identical error outputs, and show truncated error messages from each class of error. When a task terminates with an error, SyQADA automatically runs syqada errors, so you should see its output at the end of a failed run.

Once you have identified an error message in the .err file, then it is time to apply your Unix problem-solving skills to resolve the problem. The script(s) that generated the error(s) will be found in the TASK-DIRECTORY/ERRORS directory. You can study it (or any other script for that job, because the only things that will differ are the sample and file names) and even re-run it simply by invoking it at the bash prompt from the working directory, because it is an executable.

Among the errors that cause entire batches to fail, the commonest are executable not found, and reference file not found, which are corrected by editing the .config file, resetting the directory (syqada reset TASK-DIRECTORY), and rerunning syqada auto.

If you determine that the error is in the template or invocation, and need to change the protocol file to address it, then you will need to purge the directory (syqada purge TASK-DIRECTORY) before rerunning syqada auto.

Partial Successes

If your SyQADA cluster batch terminated with mostly successes and a few errors (some of them finished, some of them not), then you can infer that there is a data-dependent or transient environmental problem: On a cluster, the largest data files may be killed if they require more walltime than given in jobestimate (see Transient Failures, below, for details), or a job may simply have been assigned to a shared node with a CPU- or memory-hog process.

You can grep ERROR in job-statistics for the failed task to see if all the failures have the same wall time. Also, a job killed by the operating system has a specific signature in its error output. These are both errors that SyQADA may someday identify automatically and re-run, because they have well-defined failure modes; that’s feature request #289, but will not be implemented until the use-case presents itself with enough frequency to justify the work.

If jobestimate is the culprit, you must edit the walltime (and possibly the queue) directly in the failed jobs and then repend and run the step as above:

>>> syqada batch TASK-DIRECTORY --step repend run

This will move failed jobs back to the TASK-DIRECTORY/PENDING directory and resubmit them. The batch repend step deletes any backup files created by your editing, and moves the old log files to an archive directory.

If you are running locally, and you see mixed successes and failures, you might suspect that too-small processor or gb_memory specifications are at fault. These are used to estimate the number of sub-processes to spawn, and can cause memory-hog computations (Java being a prime example – the GATK toolkit is particularly susceptible to this) to fail memory allocation. You can address this either by correcting the specifications or by running:

>>> syqada batch TASK-DIRECTORY --step repend run --max_bolus N

where N the maximum number of jobs to run in parallel.

Java Version Problems

On the Shark cluster, using Java6 to run a Java program written for Java7 or higher will cause a deceptive error, “Not enough memory.” Because this is sufficiently arcane that it is hard to recognize, SyQADA recognizes this error and advises you that it is a version issue.

Problems Generating Filenames

Batches that fail to initialize often do so because the combination of inputdir, sample, and namepattern used to create input filenames does not produce a valid filename. This should produce a reasonably useful message that says:

No files found matching pattern TASK-DIRECTORY/output/M-DE-2\*.this-suffix-is-invalid

In this case:

TASK-DIRECTORY/output/M-DE-2/\*.this-suffix-is-invalid

*inputdir* = TASK-DIRECTORY/output
*sample* = M-DE-2
*namepattern* = \*.this-suffix-is-invalid

One of the strings (probably the namepattern, as in this case) is wrong. You may need to alter it in the METADATA or protocol to correct the problem.

A fiendish error that can occur if namepatterns are not defined carefully is that one sample name might be a prefix of another, or, in the case of the use of –split, chr1 is a prefix of chr10, etc. Thus, a namepattern with a leading wildcard must be used with care. Sometimes this can be addressed by appropriate filename suffixes in the previous step.

Because version 0.9.7 and before required batchrunner scripts, which were parsed by the shell, those scripts had to wrap namepatterns in quotes to pass them through the shell to SyQADA (which then passes them back to the shell after it assembles the inputdir, sample, and namepattern!). Consequently, some task files in the workflows directory inherited the quotes from their batchrunner ancestors. An attempt has been made to eradicate them, but some may still lurk. The moral in any case is that namepatterns specified in task files do not customarily require, and react badly to, quoting.

Simple Execution Problems

SyQADA now refuses to submit a second bolus of jobs until at least one member of the first batch of jobs completes successfully, because the commonest causes of universal job failure are configuration issues.

First and foremost, is command not found, which indicates an obvious problem in the config file with the specification of the desired executable. Less frequently, on the cluster this can be because a module load xxx command needs to be run.

A subtler module load error is caused by specifying a command such as bedtools that is dependent on a particular LIBC. In particular, for instance, you may see a message like:

/scratch/rists/hpcapps/x86_64/bedtools-2.16.2/bin/bedtools: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.11' not found (required by /scratch/rists/hpcapps/x86_64/bedtools-2.16.2/bin/bedtools)

This can be corrected with:

module load bedtools

Another common file-not-found error is caused by incorrect specification of reference files in the config file. Most common reference files can be found somewhere in TEAM_ROOT.

Another file-not-found error is a bit more subtle. One reason for batches that fail to run is a script template that misses a trailing Unix backslash (\) at the end of a line (must be last character of line, so the newline gets treated like a space). This causes the shell to think there are two commands to be run where there should have been only one. There are many variants of these errors, of course, based on where the missing backslash was. They will usually have a red-herring error that may produce a long ugly output in the .err file, because the job did not have all the parameters required to run it correctly. However, the last line of the error will generally be a command-not-found error. One example is:

131-build-annotated-csv/PENDING/TASK-DIRECTORY-runner-Myers_Tongue.sh: line 19: --output: command not found

This can, of course, be fixed by restoring the backslash in the right place.

Transient Failures

When the JobGenerator creates a new job, that job script includes the basic mechanism for determining whether a job has succeeded. First, of course, it honors the Unix custom that non-zero return status indicates failure. Second, when it starts up, it first places a timestamp (as well as its operating environment) in a file named:

LOGS/{root}-{region}.begun

Upon completion, it records a timestamp in one of:

LOGS/{root}-{region}.done

or:

LOGS/{root}-{region}.failed

If neither the .done nor the .failed file exists, then the job has not completed. If it does not still appear in the output of the cluster’s qstat command, then it was probably killed for some reason such as exceeding walltime, or perhaps running with the wrong crowd (if your job ends up on the same node with one that allocates more memory than it requested, it can die or be killed). When a job is killed by the syadmins (or via qdel), it dies without producing a .failed file. In this case, your job will remain in the RUNNING directory, and must be manually moved to the ERROR directory so that it may be repended with syqada batch.

Although SyQADA is capable of detecting at least this sort of transient failure, and could potentially restart the job itself, it has not yet become clear how to do this without risk of constant resubmission of a chronically failing job due to some cryptic error condition; automatic resubmission will only occur when a satisfactory solution presents itself.

The syqada manage command can be used to check the status of your batch at any time. See syqada manage for usage.

Elsewhere in this manual is a Useful Shell Stuff Primer that may be helpful during debugging.

Caveat on editing generated files

Failures caused by transient changes in environmental factors outside SyQADA’s control (overloaded machine, filesystem full) can be corrected simply by running syqada repend TASK-DIRECTORY. When a failure requires altering the config file, or a task’s METADATA file, syqada reset TASK-DIRECTORY is in order. When a failure requires altering the protocol file, syqada purge TASK-DIRECTORY is appropriate.

In general, however, editing any of the files used in the workflow (especially the job scripts or the METADATA file, which would be re-generated incorrectly after a syqada purge) is a no-no, of course, because it is conceivable that tasks that were completed before the edit was performed are affected by the edit, so that the crown jewel of reproducibility, removing all the task directories and re-running the workflow from scratch, or re-running the same configuration (in another directory) using a different sample file, would now fail at an earlier step.

Given sufficient freedom, I would incorporate git into SyQADA, committing changes to the fundamental and generated files and recording sha1 hashes of the reference files and executables used, so that SyQADA could detect and complain about re-runs that conflicted with previous invocations.