Preface to the Tutorials

If you are in a real hurry, jump straight to the three Example Tutorials, but you are encouraged to read the preface below.

The reproducibility of large-scale, complex analyses is one of the paramount problems of bioinformatics. This is a non-trivial engineering problem that must be addressed to perform high quality research. The System for Quality-Assured Data Analysis (SyQADA), a workflow automation system described here, seeks to address reproducibility while imposing the smallest possible learning curve on the user (Keep It Simple Stupid). SyQADA can be contrasted with most other workflow systems because its only dependencies are a unix operating system providing the bash shell and a standard installation of python 3.

A SyQADA workflow is simply a list of task definitions. To create new workflows, a user must write a bash script template that uses a simple syntax for specifying parameters that will be substituted with input and output filenames, sample names, and other values that can vary with each invocation of the script. However, SyQADA comes bundled with common next-generation sequencing (NGS) analysis pipelines including those for sequencing alignment, coverage profiling, variant calling, mutation detection, copy number profiling and variant annotation/reporting.

SyQADA relies on the Unix filesystem to record its progress and allow users to understand that progress easily. This means that simply using tools like ls, cat, and grep can tell you a lot about the execution of a SyQADA workflow. The more comfortable you are with the Unix filesystem the happier you’ll be (this is a fundamental truth of modern computing, irrespective of whether you use SyQADA).

As a matter of course, SyQADA works in a single project directory, expecting a specified sourcedata directory in some other location related to the same project. The way I have been configuring things, the directory structure looks about like this:

MYPROJECT/
         sourcedata/batchA
         working/example/
                        control

The available workflows are found in the SyQADA installation directory under workflows. For instance, the tasks for detecting somatic variation are found in:

workflows/seqdata/somaticvariation/tasks

The templates for detecting somatic variation are found in:

workflows/seqdata/somaticvariation/templates

Protocols are found in the protocols directories of each workflows subdirectory. To select one for a new project, create and cd into a new directory, and execute:

syqada begin

You will see a list of protocols. Selecting one will create a control directory and populate it with the protocol file, a config file containing the terms and paths that need to be defined for the protocol, and a dummy samples file. If you are running a somatic workflow, you’ll need to create a tumor_normal file (if you are, adding a TissueType column to the samples file can be useful for annotations).

It is sometimes useful (particularly with tumor-normal projects) to set the PROJECT environment variable. This would match the prefix of your protocol file. In bash, that would be:

export PROJECT=MYPROJECT

The samples file should contain the names of the samples you have, which are usually the names of the directories in which raw data was delivered, if you got fastq files from the sequencing core, or the filenames of vcf files.

Example Tutorials

If you’re on our team, your environment includes syqada in the PATH – if not, SYQADA is the base of the syqada-2.0-RC2 directory, and you’ll probably want to add $SYQADA/bin to your PATH variable).

The tutorial workflows are found in:

$SYQADA/workflows/tutorial/basic

The fast road here is to create a test directory, cd into it,

>>> mkdir tutor
>>> cd tutor

and then execute

>>>  $SYQADA/bin/syqada tutorial
  ( 0) Example                       Simple example protocol for tutorial
  ( 1) HAPLOHSEQ                     Protocol for real-world tutorial on the use of hapLOHseq
  ( 2) REPLICATION_TEST              Protocol for tutorial on replication
  Select the number between 0 and 2 corresponding to your choice ...

Select the number corresponding to your choice. The simple, stupid example follows. The two links here provide the specific details of the other two tutorials.

A Simple, Stupid Workflow

Now you are ready to run the four steps of a mind-bogglingly pointless workflow that counts the characters in a series of files named for the “sample names” and records them in individual files, tests the lengths of those files to see if they match a given value (8), and then runs a “QC” step (the example is used in the test suite, so it’s useful to include a variety of steps).

At this point, you probably want to look at control/EXAMPLE.reference. In this protocol, you’ll find four tasks:

Protocol Version 1.0
TASKDEF = count-characters workflows/example/control/01.count-characters.task
TASKDEF = demonstrate-failure-handling workflows/example/control/02.demonstrate-failure-handling.task
...
TASKDEF = QC_indexer  workflows/example/control/02.QC-count-outputs.task
...
TASKDEF = report-all workflows/example/control/03.report-all.task
...

Four lousy tasks! Well, it’s a model. The other lines are mostly there so that the test suite can test SyQADA functionality. Although a SyQADA task normally takes its inputs from the output of the preceding task, the added_input lines allow tasks to take additional inputs identified either by the indexing term in the task definition, or by explicit pathname. Note that the first lousy task is so simple it can be expressed as an INLINE template:

template = INLINE wc -c {inputdir}/{sample}.name > {output_prefix}.chars

An alternative (the original) representation of the same protocol is control/EXAMPLE.protocol, which specifies each task inline, and uses a separate template file for the first lousy task.

Now, go ahead and blast forward to create the structure for running the workflow on your samples by executing:

export PROJECT=Example
syqada auto --protocol EXAMPLE.reference --init

If you leave out the –protocol parameter, essentially the same protocol will run, but the TASKDEFs are specified INLINE instead of in task files.

The command will have created the control structure for the workflow by adding these directories and metadata:

01-count-characters/METADATA
02-demonstrate-failure-handling/METADATA
03-QC_indexer/METADATA
04-report-all/METADATA

Look at one or two of those METADATA files if you wish. Then repeat the previous command, removing the –init.

You should see a bunch o’ output, which should take 20 seconds or so, (unless you ran this on the cluster, in which case it might be one or two minutes or worse depending on queue load). You should see SyQADA report the submission and successful completion of all jobs for the first step, as well as the start of the second step, and culminating in something quite a bit like this:

H00:00:11.645 syqada-1.1-beta: Task 02-demonstrate-failure-handling 10 of 11 required jobs completed. Batch in error
H00:00:11.646 Batch 02-demonstrate-failure-handling result: 1 - Batch in error: stop
H00:00:11.646 Batch problem: Batch in error
Checking control directories... ...........
Checking logs... ......................
********************************************************************************
         rxia:   15 stderr,     1 stdout 02-demonstrate-failure-handling/ERROR/demonstrate-failure-handling-runner-rxia.sh%10648
********************************************************************************
-----------------------------------  stderr  -----------------------------------
--------------------------------------------------------------------------------

This demonstrates a job failure that might occur due to a
...(11)...
syqada batch 02-demonstrate-failure-handling --step repend run
will accomplish the same thing in one step
--------------------------------------------------------------------------------
********************************************************************************

      syqada believes that you're doing the tutorial. Read the stderr message.
H00:00:11.685 Automator terminated upon failure of task 02-demonstrate-failure-handling.

This will indicate, as you can see, that the second step, 02-demonstrate-failure-handling, failed, and it shows the truncated standard error output for the failed job. Note that SyQADA makes an attempt to recognize and provide help with the problem. Before we examine the failure, for practice, run:

syqada manage 01-count-characters

and look at the results. Then:

ls 01-count-characters

just to look around. Poke around there for a while; look at the LOGS and output directories, and at the DONE directory.

If you look into the second directory with ls 02-demonstrate-failure-handling/* *, you’ll see ten scripts in the DONE directory and one in the ERROR directory, as well as ten files named LOGS/.done and one named LOGS/*.failed.

Now let’s debug the error. I generally start by confirming the task status:

syqada manage 02-demonstrate-failure-handling

which produces this:

1.0
Jobs 11, Queues  PENDING 0,  RUNNING 0,  DONE 10,  ERROR 1
               ,           ,  begun 11,  done 10,  failed 1, outputs 10
Batch in error

The first line reports total jobs and contents of each status directory. The second reports the number of files of each suffix in the LOGS directory, and the number of jobs with output.

Since there’s only one error, it’s easy to find by looking for the LOGS/*.failed file. Start with

ls LOGS/*.failed

to find the guilty party, and then cat the standard error output for that job:

cat 02-demonstrate-failure-handling/LOGS/rxia.err

You’ll see this:

This demonstrates a job failure that might occur due to a
system configuration issue. In this case, the data had
a length of 8 and there was no file called no-length-based-name-bias.

If you execute
 touch no-length-based-name-bias
and then run
 syqada batch 02-demonstrate-failure-handling --step repend
it should report that the job has been moved back to PENDING, so that
it can be run again. Then
 syqada batch 02-demonstrate-failure-handling --step run
will cause it to run without error.
 syqada batch 02-demonstrate-failure-handling --step repend run
will accomplish the same thing in one step

With a workflow running real software, of course, the protgram won’t tell you what to do next. You’ll have to figure out an error message that can range from “file not found” to an obscure stack dump. With some programs, you’ll discover that standard error output is empty, and you have to look at LOGS/*.out instead. Since figuring out the cause of a failure is the hardest task in computing, SyQADA tries to make it as simple as it can be by standardizing everything. See Troubleshooting Guide for some tips on how to sort out failures.

syqada tools –errors

I hope you did examine the LOGS directory by hand just to get the feel of it, but of course, when syqada auto terminated, it reported:

Checking control directories... ...........
Checking logs... ......................
   1 error with  15 lines of stderr output
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Errors %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
********************************************************************************
         rxia:   15 stderr,     1 stdout 02-demonstrate-failure-handling/ERROR/demonstrate-failure-handling-runner-rxia.sh%22389
********************************************************************************
-----------------------------------  stderr  -----------------------------------
--------------------------------------------------------------------------------

This demonstrates a job failure that might occur due to a
...(11)...
syqada batch 02-demonstrate-failure-handling --step repend run
will accomplish the same thing in one step
--------------------------------------------------------------------------------
********************************************************************************

This is basically the equivalent of running

% syqada errors 02-demonstrate-failure-handling 5

The ...(11)... in the output indicates that there are 11 lines of standard error output elided. To get syqada to show you the whole file, re-run so:

% syqada errors 02-demonstrate-failure-handling 15

This asks SyQADA to print up to 15 lines of 1 error, which is the number of lines SyQADA just reported to you.

syqada tools is even more useful when there are many errors, because the classify categorizes them by size and uniqueness to simplify your deducing what is wrong. See syqada errors for more information.

Now let’s go ahead and do what we were told in the error message:

% touch no-length-based-name-bias
% syqada batch 02-demonstrate-failure-handling --step repend run

and the error job(s) will be restored to PENDING. If you want, you can check that by repeating the syqada manage command from above. But let’s go ahead and finish the workflow by resuming syqada auto – simply re-run the syqada auto command same as before.

This will confirm that task 01 is complete, discover and report the incompleteness of 02, and then ask you if you want to resume. When you respond yes, SyQADA will re-submit the one job that was repended, and then, when it succeeds, continue by finishing task 03 and task 04 (syqada auto simply runs syqada manage to determine whether a task is complete).

If you have not done so already, browse the templates directory and compare those templates with the generated scripts (which are now all in the DONE directories of the various steps) to see how a template is converted into a script.

Task 03 demonstrates the use of the QC designation, and the template also shows two ways of specifying the list of files in an input directory.

Task 04 demonstrates the use of the task specification:

jobgeneration = summary

Only a single job is created, which analyzes all the output of the previous task. Because the immediate previous task is prefixed with QC, task 04 looks back beyond it to task 02 for its input.

For a tutorial on real-world bioinformatics using the hapLOHseq allelic imbalance detector developed in our lab, see Real-World Tutorial: hapLOHseq. For a tutorial on the use of replication, see Replication Tutorial.

New workflow

To create a new workflow, you would need to create your own simple task and template definitions for steps in the workflow. Examples of those are provided for this workflow in the example/haplohseq/tasks and example/haplohseq/templates directories.

You need to create:

A protocol file:

Lists sequential steps to be executed by the workflow.

A config file:

Specifies software and data dependencies for the workflows.  These
variables can be referenced in protocol, task and template files so
that workflows can easily be ported to other platforms by simply
modifying config files.

A samples file:

Lists names of samples to process through workflows.  These names
are used as prefixes for intermediate and output files of the
workflow.

And tasks and templates:

These are definition files for steps in the workflow.  Tasks define
resources needed for a step in the workflow to be executed on an LSF
or PBS cluster (or on a local server or desktop).  Tasks also allow
the user to split jobs based on chromosomes.  Template files define
the actual step to be executed.

log files:

Log files are generated for each step in the workflow including
logs for console output, errors and a job completion status.

output files:

Each step of the workflow contains an output directory that contains
the artifacts generated by that step.