Task Definitions

To create a task definition, either in the protocol file or in a separate task file referred to in the protocol file. Here is an example of a task definition that works with the script template described next in How to Build a Script Template:

TASKDEF = a-task-name INLINE
template = workflows/seqdata/alignment/templates/filter-exome.template
gb_memory = 4
processors = 1
jobestimate = 60:00
namepattern = .bam

The terms gb_memory, processors, and jobestimate are required for any task definition; they are, of course, dependent on the memory and computation requirements of the step. They are used to help manage the job queue for queueing systems such as PBS or LSF, and also to control the number of jobs simultaneously submitted on the local host by the LOCAL queue manager. The namepattern defines how to find the inputs for the sample in question.

Numerous other attributes exist. A detailed description of all options is found in Constructing a new task specification. In addition, an individual task definition can define its own attributes based on the needs of the template it requires.

Controlling job generation

Job generation is controlled by several terms in the task description: jobgeneration, tumor_normal, and namepattern.

When jobgeneration = bashglob, which is the default, SyQADA builds a bash regular expression to identify filenames for input to a job. The “bashglob” is formed by concatenating the input directory (inputdir) for the step with each sample name in turn and then adding the namepattern (see Generated Terms (inputdir) for the explanation of how inputdir is defined for each step). Given inputdir=input, sample=Sample1, and namepattern=.txt, this looks like:

input/Sample1.txt

The resulting pattern is expanded as a bash glob (NOT as a python regular expression!). By default, one job is generated for each match.

There are several different useful jobgeneration possibilities:

1. simple (jobgeneration = bashglob or not specified)
2. simple with glob (jobgeneration = bashglob or not specified)
3. jobgeneration = merge
4. jobgeneration = irregular
5. jobgeneration = generate
6. jobgeneration = summary (with or without glob)
7. tumor_normal = tumor-normal-filename (jobgeneration cannot be specified)

The only combination of jobgeneration options that makes sense at this time is irregular with any of the others. Certain combinations produce warning messages, but one or two that might seem to make sense are likely to behave unpredictably if they do not simply cause an unceremonious SyQADA process failure.

The following examples illustrate the basic cases. Let us start with these assumptions:

1. this is step two of the protocol,
2. `inputdir` has been defined by SyQADA as `01-the-first-step`,
3. the project `.samples` file contains two samples,`S1` and `S2`,
4. the task name (first argument to TASKDEF) is task2.
5. the project name (normally the prefix to the config, protocol and sample files) is JobGen

Finally, let us suppose that the template is simply:

template = INLINE echo {project} {sample} {filename} > {output_prefix}.out

Examples:

Case 1, simple:

namepattern = .bam

SyQADA builds each of these strings:

01-the-first-step/output/S1.bam
01-the-first-step/output/S2.bam

to identify filenames for input to the job. The resulting pattern is expanded as a unix glob (NOT as a python regular expression!) and one job is generated for each match.

Two jobs will be generated, named task2-S1.sh and task2-S2.sh. If you were to grep echo on the PENDING directory, the result would look like:

% grep echo 02-task2/PENDING/*
02-task2/PENDING/task2-runner-S1.sh: echo JobGen S1 01-the-first-step/output/S1.bam > 02-task2/output/S1.out
02-task2/PENDING/task2-runner-S2.sh: echo JobGen S2 01-the-first-step/output/S2.bam > 02-task2/output/S2.out

Case 2, simple with glob:

namepattern = *.vcf

Now we need to imagine that there are multiple vcfs for each sample in the inputdir, perhaps named S1-N.vcf and S1-T.vcf. {sample} does not change, but {output_prefix} does. Now the grep result would look like:

% grep echo 02-task2/PENDING/*
02-task2/PENDING/task2-runner-S1-N.sh: echo JobGen S1 01-the-first-step/output/S1-N.vcf > 02-task2/output/S1-N.out
02-task2/PENDING/task2-runner-S1-T.sh: echo JobGen S1 01-the-first-step/output/S1-T.vcf > 02-task2/output/S1-T.out
02-task2/PENDING/task2-runner-S2-N.sh: echo JobGen S2 01-the-first-step/output/S2-N.vcf > 02-task2/output/S2-N.out
02-task2/PENDING/task2-runner-S2-T.sh: echo JobGen S2 01-the-first-step/output/S2-T.vcf > 02-task2/output/S2-T.out

A frequent use of Case 2 is to run bwa on multiple fastqs of the same sample, where all the fastqs are in a directory named for the sample.

namepattern = /*.fastq

Case 3, simple with glob, adding jobgeneration = merge (same namepattern and imagining as Case 2):

Only one job is produced per sample. However, we have to rewrite the template, because with multiple files in the glob, SyQADA expects the {filename} term to be the only SyQADA expansion term (enclosed in {}) on its line in the template. For this case, the template is:

echo {sample} \
    {filename} \
    > {output_prefix}.out

Now to see the filenames, our grep is different. The result, which lists only two files, one for each sample, would look like:

% grep step/output 02-task2/PENDING/*
02-task2/PENDING/task2-runner-S1.sh: 01-the-first-step/output/S1-N.vcf \
02-task2/PENDING/task2-runner-S1.sh: 01-the-first-step/output/S1-T.vcf \
02-task2/PENDING/task2-runner-S2.sh: 01-the-first-step/output/S2-N.vcf \
02-task2/PENDING/task2-runner-S2.sh: 01-the-first-step/output/S2-T.vcf \

Case 4, jobgeneration = irregular:

SyQADA makes the assumption that its input data is uniform, and checks to make sure that the same number of jobs is generated for each sample. If you are using Case 2, simple, with glob, and your input data (usually the sourcedata) has differing numbers of input files for each sample, SyQADA will stop with this message (with N and M appropriately substituted):

Expected to generate (N) jobs for task 02-task2, but found (M)
(use syqada batch ... --irregular or specify jobgeneration = irregular to override)

Adding jobgeneration = irregular to the METADATA for the step, or running:

syqada batch 02-task2 --step init run --irregular

will cause SyQADA to ignore the issue.

Case 5, jobgeneration = generate:

This specification is paired in the TASKDEF as follows:

jobgeneration = generate
namepattern = ignore

The result is that one job is generated per sample, with the terms {sample} and {output_prefix} defined as before. {filename} is undefined, because standard filetype extensions can be added to {sample} in the template. This offers an alternative to Case 1 and Case 3.

With our initial assumptions above, namepattern = ignore will provoke a SyQADA error saying that {filename} had not been found (a bug, I know, it ought to tell you what’s wrong – we’ll work on that). If we use a filetype extension in the template to specify the desired file, something like this there will be no complaint:

bedtools sort {sample}.vcf > {output_prefix}-sorted.vcf

Case 6, jobgeneration = summary:

This usage creates a single job named for the project. If namepattern is defined, a glob is formed by concatenating the inputdir and the namepattern, and the list of resulting files is available in the {filename} attribute. jobgeneration = summary implies jobgeneration = merge, so if there are wildcards in the glob, the {filename} term must be the only {} term on its line in the template, as in Case 3.

Case 7, tumor_normal = tumor-normal-filename:

For computing on paired tumor-normal samples of any kind, the metadata should include the term tumor_normal. There is no convention or default for this, but my personal habit is to place that file in the control directory and name it for the project. Thus, for our working example:

tumor_normal = control/JobGen.tumor_normal

SyQADA expects the file to contain three tab-separated columns, in order, “experiment” name, normal-sample name, tumor-sample name. I use the term experiment rather than case or individual because we have been known to build tumor_normal files containing pairs of multiple putative normal samples crossed with multiple putative somatic variant samples