.. _script-templates: How to Build a Script Template ============================== Script templates are typically built from working invocations, and then the SyQADA- and configuration-dependent terms extracted into standard terms enclosed in braces. Here is a simple example. The `#!/bin/bash` on the first line is optional as a reminder that the script will eventually be run as a shell script, but it will be supplanted by another #!/bin/bash during job generation:: #!/bin/bash {bedtools} intersect \ -wa \ -header \ -abam {filename} \ -b {capturefile} \ > {output_prefix}.bam The terms `bedtools` and `capturefile` are expected to be defined in the config file, and should be specific as to full path and version. *Failure to specify full paths to executables and reference files can jeopardize your ability to reproduce your workflow.* The `capturefile` should also defined in the config file, and is dependent on the capture technology used for the sequencing experiment. The terms `filename` and `output_prefix` are standard terms produced by the SyQADA job generator based on the project inputs. Special Considerations ---------------------- Several terms are special to specific kinds of analysis. The terms `groupid`, `forward`, and `reverse` are only used for the `bwa sampe` step (task 02) of the Illumina alignment workflow. They are used to resolve the paired read files generated by Illumina sequencing. The terms `tumorname` and `normalname` are used for tumor-normal comparisons such as during somatic variation detection using Mutect. In this case only, the term `sample` is filled from the name given to the tumor-normal pair in the first column of the tumor-normal file, which has the format:: individual normal_sample tumor_sample on each line (names are separated by tabs) INLINE templates ---------------- [Effective syqada 2.0-delta] a simple one-line command can be expressed as an INLINE template. For example:: template = INLINE wc -c {inputdir}/{sample}.name > {output_prefix}.chars This example comes from the basic tutorial protocol named Example.reference. Error reporting --------------- As an aid to understanding configuration errors, SyQADA parses the template to look for terms found within braces that are not defined in either the config file or the protocol file (or a task file by the protocol file). If any are found, an error is report similar to this:: Keys not found in the configured task: ['missing_term1', 'missing_term2'] .. _terms: Generated Terms --------------- These are the terms that are defined by SyQADA during job generation. They can be used as appropriate for a given script. check_error This is used for actions that have multiple command invocations within them. It is replaced by standard shell conditional code that will write timestamp to the `.failed` file if the previous step returned a non-zero error code, and proceed if it returned OK. chromosome Substituted with the generated chromosome during splitting of a job for locus-based parallellism. See `region`, which is preferred in most cases. filename The generated input filename for each created invocation script. If the invocation script uses the --merge option (or the --summary option, which implicitly does a --merge), then the {filename} term must stand alone with no other braced terms on the same line. files_per_job The number of output files that each job should generate. This is normally unnecessary, because SyQADA checks to make sure that all successful jobs produce the same number of outputs, but can be used to validate number of output files for summary jobs, because a single job has nothing to compare to. It could also be used to allow `files_per_job = 0` for those occasional cases where the output name does not match the sample name, but I would think that a QC step would be a better idea. forward Forward readname for forward-reverse read matching. groupid Group name for forward-reverse read matching. .. _inputdir: inputdir Location of the directory or directories in which to find the input for this step. The *inputdir* for any task is the output directory of the preceding step. If *inputdir* is specified in the task description or protocol, the default is overridden with the new value, which may be either an existing directory or a task identifier. A step whose task identifier begins with 'QC' is ignored when determining which is the preceding step. (A step whose task identifier begins with 'QC' will use the preceding step's output as input whether its identifier begins with 'QC' or not) The job generator then uses the sample name and the namepattern to build an input or a set of inputs using shell globbing syntax. The special specification:: inputdir = sourcedata can be used to repeat the use of the sourcedata directory for input during a later step Additional input directories may be specified in the task description or protocol using *added_input*, which can be a comma-separated list of existing directories or task identifiers. They will be referenced by index as {added_input=1}, {added_input=2}, etc. {added_input} is the equivalent of {added_input=1}. *added_input* does not contribute to the definition of *filename*, so templates must use constructions like {added_input=1}/{sample}.suffix to make reference to specific files. The term *sourcedata* may be used to specify the sourcedata directory. iteration If *{iteration}* is specified in the template, a number of jobs corresponding to the value of the term *iterations* will generated, populating the value of *iteration* in each job with the corresponding iteration number. To trigger this behavior, the term *iterations* must appear in the task definition (or in the parameters for that TASKDEF in the protocol). *iterations* can either be a number or a comma-separated set of numbers, and behaves a bit like the inputs to the python range() function, e.g.: from,to,step, except that . If the value of *iterations* is instead a string, (e.g., override), then the value is taken from the value of *iterations* in the config file. The claim is that this simplifies the construction of workflows like the SparCC runner, which uses the value of *iterations* to generate that number of permutations, and then later uses *iteration* to iterate over those permutations. As to simplicity, your opinion may vary, but it did avoid the duplication of the parameter from the config file into the protocol/task file. logdir Location of LOGS directory, used occasionally when a program creates its own log output other than stdout/stderr. mergefile Obsolete term formerly used to indicate filling multiple filenames as input, replaced by *filename*. normalname The normal input for a tumor/normal pair defined by a --tumor_normal option. The individual from whom the pair was taken is available as *sample*. This term obsolesces the term *normalfile*. output_prefix Shorthand for {outputdir}/{root}{region} outputdir Location of directory in which to find all the input for this step, path created by the job generator. project The project name provided as METADATA, most often used during summary tasks. region Substituted with the generated locus range during splitting of a job for locus-based parallellism. reverse Reverse readname for forward-reverse read matching. root The rootname (with trailing dotted suffix removed, equivalent to the shell syntax $blah:r) of an output file, to which an appropriate suffix is usually then added. sample The sample name taken from the sample_file, normally PROJECT.samples. In the case of the tumor_normal option, the individual from whom the pair was taken is available as *sample*. sample_file The file, normally PROJECT.samples, containing information about the samples in the project, one sample per line. See :ref:`sample_file` for details. tmpdir Location of the standard temporary directory for the project, often used as a value for java.io.tmpdir, but available for any similar purpose. An attempt is made to clean up its contents at the end of a successful batch. touch_output This is used for actions that produce no output, such as the import and annotation steps of vtools. It is replaced by a command that will create a {root}.complete file in the {outputdir}, so that SyQADA can conclude that the step completed successfully. (See :ref:`hacks`) tumorname The tumor input for a tumor/normal pair defined by a --tumor_normal option. The individual from whom the pair was taken is available as *sample*. This term obsolesces the term *tumorfile*.