How to Build a Script Template

Script templates are typically built from working invocations, and then the SyQADA- and configuration-dependent terms extracted into standard terms enclosed in braces. Here is a simple example. The #!/bin/bash on the first line is optional as a reminder that the script will eventually be run as a shell script, but it will be supplanted by another #!/bin/bash during job generation:

#!/bin/bash

{bedtools} intersect \
  -wa \
  -header \
  -abam {filename} \
  -b {capturefile} \
      > {output_prefix}.bam

The terms bedtools and capturefile are expected to be defined in the config file, and should be specific as to full path and version. Failure to specify full paths to executables and reference files can jeopardize your ability to reproduce your workflow. The capturefile should also defined in the config file, and is dependent on the capture technology used for the sequencing experiment. The terms filename and output_prefix are standard terms produced by the SyQADA job generator based on the project inputs.

Special Considerations

Several terms are special to specific kinds of analysis.

The terms groupid, forward, and reverse are only used for the bwa sampe step (task 02) of the Illumina alignment workflow. They are used to resolve the paired read files generated by Illumina sequencing.

The terms tumorname and normalname are used for tumor-normal comparisons such as during somatic variation detection using Mutect. In this case only, the term sample is filled from the name given to the tumor-normal pair in the first column of the tumor-normal file, which has the format:

individual normal_sample tumor_sample

on each line (names are separated by tabs)

INLINE templates

[Effective syqada 2.0-delta] a simple one-line command can be expressed as an INLINE template. For example:

template = INLINE wc -c {inputdir}/{sample}.name > {output_prefix}.chars

This example comes from the basic tutorial protocol named Example.reference.

Error reporting

As an aid to understanding configuration errors, SyQADA parses the template to look for terms found within braces that are not defined in either the config file or the protocol file (or a task file by the protocol file). If any are found, an error is report similar to this:

Keys not found in the configured task: ['missing_term1', 'missing_term2']

Generated Terms

These are the terms that are defined by SyQADA during job generation. They can be used as appropriate for a given script.

check_error
This is used for actions that have multiple command invocations within them. It is replaced by standard shell conditional code that will write timestamp to the .failed file if the previous step returned a non-zero error code, and proceed if it returned OK.
chromosome
Substituted with the generated chromosome during splitting of a job for locus-based parallellism. See region, which is preferred in most cases.
filename
The generated input filename for each created invocation script. If the invocation script uses the –merge option (or the –summary option, which implicitly does a –merge), then the {filename} term must stand alone with no other braced terms on the same line.
files_per_job
The number of output files that each job should generate. This is normally unnecessary, because SyQADA checks to make sure that all successful jobs produce the same number of outputs, but can be used to validate number of output files for summary jobs, because a single job has nothing to compare to. It could also be used to allow files_per_job = 0 for those occasional cases where the output name does not match the sample name, but I would think that a QC step would be a better idea.
forward
Forward readname for forward-reverse read matching.
groupid
Group name for forward-reverse read matching.
inputdir

Location of the directory or directories in which to find the input for this step. The inputdir for any task is the output directory of the preceding step. If inputdir is specified in the task description or protocol, the default is overridden with the new value, which may be either an existing directory or a task identifier. A step whose task identifier begins with ‘QC’ is ignored when determining which is the preceding step. (A step whose task identifier begins with ‘QC’ will use the preceding step’s output as input whether its identifier begins with ‘QC’ or not) The job generator then uses the sample name and the namepattern to build an input or a set of inputs using shell globbing syntax. The special specification:

inputdir = sourcedata

can be used to repeat the use of the sourcedata directory for input during a later step

Additional input directories may be specified in the task description or protocol using added_input, which can be a comma-separated list of existing directories or task identifiers. They will be referenced by index as {added_input=1}, {added_input=2}, etc. {added_input} is the equivalent of {added_input=1}. added_input does not contribute to the definition of filename, so templates must use constructions like {added_input=1}/{sample}.suffix to make reference to specific files. The term sourcedata may be used to specify the sourcedata directory.

iteration

If {iteration} is specified in the template, a number of jobs corresponding to the value of the term iterations will generated, populating the value of iteration in each job with the corresponding iteration number. To trigger this behavior, the term iterations must appear in the task definition (or in the parameters for that TASKDEF in the protocol). iterations can either be a number or a comma-separated set of numbers, and behaves a bit like the inputs to the python range() function, e.g.: from,to,step, except that . If the value of iterations is instead a string, (e.g., override), then the value is taken from the value of iterations in the config file. The claim is that this simplifies the construction of workflows like the SparCC runner, which uses the value of iterations to generate that number of permutations, and then later uses iteration to iterate over those permutations. As to simplicity, your opinion may vary, but it did avoid the duplication of the parameter from the config file into the protocol/task file.
logdir
Location of LOGS directory, used occasionally when a program creates its own log output other than stdout/stderr.
mergefile
Obsolete term formerly used to indicate filling multiple filenames as input, replaced by filename.
normalname
The normal input for a tumor/normal pair defined by a –tumor_normal option. The individual from whom the pair was taken is available as sample. This term obsolesces the term normalfile.
output_prefix
Shorthand for {outputdir}/{root}{region}
outputdir
Location of directory in which to find all the input for this step, path created by the job generator.
project
The project name provided as METADATA, most often used during summary tasks.
region
Substituted with the generated locus range during splitting of a job for locus-based parallellism.
reverse
Reverse readname for forward-reverse read matching.
root
The rootname (with trailing dotted suffix removed, equivalent to the shell syntax $blah:r) of an output file, to which an appropriate suffix is usually then added.
sample
The sample name taken from the sample_file, normally PROJECT.samples. In the case of the tumor_normal option, the individual from whom the pair was taken is available as sample.
sample_file
The file, normally PROJECT.samples, containing information about the samples in the project, one sample per line. See The Sample File for details.
tmpdir
Location of the standard temporary directory for the project, often used as a value for java.io.tmpdir, but available for any similar purpose. An attempt is made to clean up its contents at the end of a successful batch.
touch_output
This is used for actions that produce no output, such as the import and annotation steps of vtools. It is replaced by a command that will create a {root}.complete file in the {outputdir}, so that SyQADA can conclude that the step completed successfully. (See Hacks – Egregious)
tumorname
The tumor input for a tumor/normal pair defined by a –tumor_normal option. The individual from whom the pair was taken is available as sample. This term obsolesces the term tumorfile.