.. _syqada_tutorial:

Preface to the Tutorials
------------------------

If you are in a real hurry, jump straight to the three :ref:`example_tutorials`, but you are
encouraged to read the preface below.

The reproducibility of large-scale, complex analyses is one of the
paramount problems of bioinformatics.  This is a non-trivial
engineering problem that must be addressed to perform high quality
research.  The System for Quality-Assured Data Analysis (SyQADA), a
workflow automation system described here, seeks to address
reproducibility while imposing the smallest possible learning curve on
the user (Keep It Simple Stupid). SyQADA can be contrasted with most
other workflow systems because its only dependencies are a unix
operating system providing the bash shell and a standard installation
of python 3.

A SyQADA workflow is simply a list of task definitions.  To create new
workflows, a user must write a bash script template that uses a simple
syntax for specifying parameters that will be substituted with input
and output filenames, sample names, and other values that can vary
with each invocation of the script.  However, SyQADA comes bundled
with common next-generation sequencing (NGS) analysis pipelines
including those for sequencing alignment, coverage profiling, variant
calling, mutation detection, copy number profiling and variant
annotation/reporting.

SyQADA relies on the Unix filesystem to record its progress and allow users
to understand that progress easily. This means that simply using tools like
ls, cat, and grep can tell you a lot about the execution of a SyQADA workflow.
The more comfortable you are with the Unix filesystem the happier you'll be
(this is a fundamental truth of modern computing, irrespective of whether you use SyQADA).

As a matter of course, SyQADA works in
a single project directory, expecting a specified sourcedata directory
in some other location related to the same project. The way I have been
configuring things, the directory structure looks about like this::

  MYPROJECT/
           sourcedata/batchA
           working/example/
                          control

The available workflows are found in the SyQADA installation directory
under workflows. For instance, the tasks for detecting somatic variation are found
in::

  workflows/seqdata/somaticvariation/tasks

The templates for detecting somatic variation are found in::

  workflows/seqdata/somaticvariation/templates

Protocols are found in the protocols directories of each workflows subdirectory.
To select one for a new project, create and cd into a new directory, and execute::

  syqada begin

You will see a list of protocols. Selecting one will create a control
directory and populate it with the protocol file, a config file
containing the terms and paths that need to be defined for the
protocol, and a dummy samples file.  If you are running a somatic
workflow, you'll need to create a tumor_normal file (if you are,
adding a TissueType column to the samples file can be useful for
annotations).

It is sometimes useful (particularly with tumor-normal projects) to set
the PROJECT environment variable. This would match the prefix of your
protocol file.  In bash, that would be::

  export PROJECT=MYPROJECT

The samples file should contain the names of the samples you have, which are usually
the names of the directories in which raw data was delivered, if you got
``fastq`` files from the sequencing core, or the filenames of ``vcf`` or ``CEL`` files.

.. _example_tutorials:

------------------
Example Tutorials
------------------

Internally, SyQADA sets the environment variable
SYQADA to the |release| directory.
You may wish to set that variable in your own environment, but you will probably at
least want to add |release|/bin to your PATH variable to simplify your
life (if you're on our team, the team bash_profile does so).

The tutorial workflows are found in::

 $SYQADA/workflows/tutorial

The fast road, however, is to create a test directory, cd into it,

>>> mkdir tutor
>>> cd tutor

and then execute

>>>  syqada tutorial
  ( 0) Example              Simple example protocol for tutorial
  ( 1) Features             Protocol for tutorial on special features
  ( 2) HAPLOHSEQ            Protocol for real-world tutorial on the use of hapLOHseq
  Select the number between 0 and 2 corresponding to your choice ... 

Select the number corresponding to your choice. The simple, stupid example
follows. The two links here provide
the specific details of the other two tutorials.

* :ref:`syqada_tutorial_haplohseq`
* :ref:`syqada_tutorial_features`

.. _simple_stupid_workflow:

A Simple, Stupid Workflow
-------------------------

Now you are ready to run the four steps of a mind-bogglingly pointless
workflow that counts the characters in a series of files named for the "sample names" and records
them in individual files, tests the lengths of those files to see if they match
a given value (8), and then runs  a "QC" step (the example is used in the test suite,
so it's useful to include a variety of steps).

At this point, you probably want to look at control/EXAMPLE.reference.
In this protocol, you'll find four tasks::

  Protocol Version 1.0
  TASKDEF = count-characters workflows/example/control/01.count-characters.task
  TASKDEF = demonstrate-failure-handling workflows/example/control/02.demonstrate-failure-handling.task
  ...
  TASKDEF = QC_indexer  workflows/example/control/02.QC-count-outputs.task
  ...
  TASKDEF = report-all workflows/example/control/03.report-all.task
  ...

Four lousy tasks! Well, it's a model. The other lines are mostly there so that
the test suite can test SyQADA functionality. Although a SyQADA task normally
takes its inputs from the output of the preceding task, the *added_input* lines
allow tasks to take additional inputs identified either by the indexing term
in the task definition, or by explicit pathname. Note that the first lousy task
is so simple it can be expressed as an INLINE template::

  template = INLINE wc -c {inputdir}/{sample}.name > {output_prefix}.chars

An alternative (the original) representation of the
same protocol is control/EXAMPLE.protocol, which specifies each task inline,
and uses a separate template file for the first task.

Now, blast ahead and create the structure for the first task of
the workflow by executing::

  syqada auto --init

The command will have created the first task directory and metadata::

  01-count-characters/METADATA

Look at this METADATA file if you wish. It defines the necessary parameters for syqada to create and execute the jobs associated with this first step.

Then repeat the
previous command, removing the --init::

  syqada auto

You should see a bunch o' output, which should take 20 seconds or so to finish.
SyQADA will report the submission
and successful completion of all jobs for the first step, as well as
the start of the second step, and culminating in something quite a bit like this::

  H00:00:11.645 syqada-1.1-beta: Task 02-demonstrate-failure-handling 10 of 11 required jobs completed. Batch in error
  H00:00:11.646 Batch 02-demonstrate-failure-handling result: 1 - Batch in error: stop
  H00:00:11.646 Batch problem: Batch in error
  Checking control directories... ...........
  Checking logs... ......................
  ********************************************************************************
           rxia:   15 stderr,     1 stdout 02-demonstrate-failure-handling/ERROR/demonstrate-failure-handling-runner-rxia.sh%10648
  ********************************************************************************
  -----------------------------------  stderr  -----------------------------------
  --------------------------------------------------------------------------------
  
  This demonstrates a job failure that might occur due to a
  ...(11)...
  syqada batch 02-demonstrate-failure-handling --step repend run
  will accomplish the same thing in one step
  --------------------------------------------------------------------------------
  ********************************************************************************

  /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
  ** syqada believes this is an intentional error for the tutorial.
  ** For more information, run:
  **     syqada errors --help EXAMPLE_MESSAGE
  \/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/

  Automator terminated task 02-demonstrate-failure-handling with an error.

This will indicate, as you can see, that the second step, 02-demonstrate-failure-handling, failed,
and it shows the truncated standard error output for the failed job. Note that `syqada errors`
makes an attempt to recognize and provide help with the problem.
In this case,  SyQADA is able to guess the nature of the error, so it
tells us that we can get more information by running::

  syqada errors --help EXAMPLE_MESSAGE

Doing so provides a more detailed description of the problem.

Before we start diagnosing the failure, for practice, run::

  syqada manage 01-count-characters

and look at the results. Then::

  ls 01-count-characters

just to look around. Poke around there for a while; look at the LOGS and output
directories, and at the DONE directory.

Now, If you look into the second directory, you'll see ten scripts
in the DONE directory and one in the ERROR directory, as well as ten files
named LOGS/\*.done and one named LOGS/rxia.failed.

So let's debug the error (SyQADA told us, but we one is not always so lucky).
I generally start by confirming the task status::

  syqada manage 02-demonstrate-failure-handling

which produces this::

 1.0
 Jobs 11, Queues  PENDING 0,  RUNNING 0,  DONE 10,  ERROR 1
                ,           ,  begun 11,  done 10,  failed 1, outputs 10
 Batch in error

The first line reports total jobs and contents of each status directory.
The second reports the number of files of each suffix in the LOGS directory,
and the number of jobs with output.

Since there's only one error, it's easy to find by looking for
the .failed file in LOGS. Start with

  ls LOGS/\*.failed

to find the guilty party, and then cat the standard error output for that job::

  cat 02-demonstrate-failure-handling/LOGS/rxia.err 

You'll see this::

 This demonstrates a job failure that might occur due to a
 system configuration issue. In this case, the data had
 a length of 8 and there was no file in the working directory named no-length-based-name-bias.

 If you execute
  touch no-length-based-name-bias
 and then run
  syqada batch 02-demonstrate-failure-handling --step repend
 it should report that the job has been moved back to PENDING, so that
 it can be run again. Then
  syqada batch 02-demonstrate-failure-handling --step run
 will cause it to run without error.
  syqada batch 02-demonstrate-failure-handling --step repend run
 will accomplish the same thing in one step

With a workflow running real software, of course, the program probably won't tell you what to do next
(GATK programs give very helpful messages, but the norm is simply to report the error or worse, fail
unceremoniously).
You'll have to figure out an error message that
can range from "file not found" to an obscure stack dump. With some programs, you'll discover that
standard error output is empty, and you have to look at LOGS/\*.out instead. You may need to examine
the files found in the output directory to determine whether the program generated any output at all
to help determine the cause of the error.
Since figuring out
the cause of a failure is the hardest task in computing, SyQADA tries to make it as simple as it can be
by standardizing the location of outputs, and by parsing the stderr file to detect some
common failures generated by configuration errors and suggest possible causes.
See :ref:`Troubleshooting` for some tips on how to sort out failures.

Using the syqada errors command
--------------------------------

You might examine the LOGS directory by hand just to get the feel of it, but of course,
when *syqada auto* terminated, it reported::

 Checking control directories... ...........
 Checking logs... ......................
    1 error with  15 lines of stderr output
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Errors %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 ********************************************************************************
          rxia:   15 stderr,     1 stdout 02-demonstrate-failure-handling/ERROR/demonstrate-failure-handling-runner-rxia.sh%22389
 ********************************************************************************
 -----------------------------------  stderr  -----------------------------------
 --------------------------------------------------------------------------------
 
 This demonstrates a job failure that might occur due to a
 ...(11)...
 syqada batch 02-demonstrate-failure-handling --step repend run
 will accomplish the same thing in one step
 --------------------------------------------------------------------------------
 ********************************************************************************

This is basically the equivalent of running

 % syqada errors 02-demonstrate-failure-handling 5

The ...(11)... in the output indicates that there are 11 lines of standard error
output elided. To get syqada to show
you the whole file, re-run so::

 % syqada errors 02-demonstrate-failure-handling 15

This asks SyQADA to print up to 15 lines of a single error.

syqada errors can be even more useful when there are many errors, because
it categorizes them by size and uniqueness to simplify your deducing what is wrong. See
:ref:`syqada_errors` for more information.

Now let's go ahead and do what we were told in the error message::

  % touch no-length-based-name-bias
  % syqada batch 02-demonstrate-failure-handling --step repend run

and the error job(s) will be restored to PENDING. If you want, you
can check that by repeating the syqada manage command from above. But let's go ahead and finish the
workflow by resuming syqada auto -- simply re-run the *syqada auto* command same as before.

This will confirm that task 01 is complete, discover and report the incompleteness of 02,
and then ask you if you want to resume. When you respond yes, SyQADA will
re-submit the one job that was repended, and then, when it succeeds,
continue by finishing task 03 and task 04 (syqada auto simply runs syqada manage to determine whether
a task is complete).

If you have not done so already, browse the templates directory and compare those templates
with the generated scripts (which are now all in the DONE directories of the various steps)
to see how a template is converted into a script.

Task 03 demonstrates the use of the QC designation, and the template also shows
two ways of specifying the list of files in an input directory.

Task 04 demonstrates the use of the task specification::

  jobgeneration = summary

Only a single job is created, which analyzes all the output of the previous task.
Because the immediate previous task is prefixed with QC, task 04 looks back beyond
it to task 02 for its input.

In addition, Task 04 illustrates how the *added_input* parameter is populated by the template.
The protocol defines two values for added_input, separated by a comma; examine the output file


For a tutorial on real-world bioinformatics using the hapLOHseq allelic
imbalance detector developed in our lab, see :ref:`syqada_tutorial_hapLOHseq`.
For a tutorial on the use of replication, see :ref:`syqada_tutorial_features`.

Creating a New Workflow
-----------------------

To create a new workflow, you would need to create your own
simple task and template definitions for steps in the workflow.
Examples of those are provided for this workflow in the
example/haplohseq/tasks and example/haplohseq/templates directories.

You need to create:

A protocol file::

  Lists sequential steps to be executed by the workflow.

A config file::

  Specifies software and data dependencies for the workflows.  These
  variables can be referenced in protocol, task and template files so
  that workflows can easily be ported to other platforms by simply
  modifying config files.

A samples file::

  Lists names of samples to process through workflows.  These names
  are used as prefixes for intermediate and output files of the
  workflow.

And tasks and templates::

  These are definition files for steps in the workflow.  Tasks define
  resources needed for a step in the workflow to be executed on an LSF
  or PBS cluster (or on a local server or desktop).  Tasks also allow
  the user to split jobs based on chromosomes.  Template files define
  the actual step to be executed.

log files::

    Log files are generated for each step in the workflow including
    logs for console output, errors and a job completion status.

output files::

    Each step of the workflow contains an output directory that contains
    the artifacts generated by that step.

Running the command::

  syqada begin

will provide a list of existing protocols that you can start a project with.
When you select a number from the list, that protocol will be used to build a
skeletal control directory with protocol, config, and sample
files you can edit to prepare them to run the protocol.