.. _glossary:

Glossary
========

batch
    All the jobs that need to be performed for a particular task or step of a
    protocol

batch runner
    BatchRunner.py is the program invoked by SyQADA to manage job generation
    and batch execution.

batchroot
    The directory in which the jobs for a task will be managed. This directory contains
    a METADATA file as well as a series of directories, PENDING, RUNNING, DONE, ERROR, and LOGS,
    which record the state of the batch. Additional directories QUEUED and STUCK are created
    if those conditions are detected by SyQADA. SyQADA commands batch, manage, and tools all
    require the specification of batchroot as their first argument in order to work correctly.

big-O notation
    Optional representation of computational complexity used in specification of `jobestimate` and `gb_memory`,
    if you wish to try to improve the accuracy of walltime estimates. See :ref:`tuning_estimates`.

config file
    A file that contains key-value pairs that will be used to populate
    values in the script templates during job generation. Most of the
    values are paths to software or reference files. This file should
    remain unaltered during the lifetime of the pipeline (presumably
    one might add to it, but one should not change an existing
    setting, such as the version of a piece of software, after using
    that software in the pipeline). An example config file is found in
    workflows/control/Example.config.

environment
    The standard Unix dictionary of variable names and their mapped
    values that is provided to a Unix process when it starts. In the
    bash shell, the environment can be displayed by running the
    ``env`` command.  Environment variables are set in the bash shell
    and passed to child processes by executing the command ``export
    VARNAME=value`` They are used by prepending the variable name with
    a dollar sign. Thus, given the command above, the command ``echo
    $VARNAME`` would print ``value``. There are a few environment
    variables that are used in the default config.
    It is very fond of TEAM_ROOT, because that is relatively well fixed
    for the Scheet Lab on a given machine. One should avoid environment variables
    in general, though, because they usually become so well ingrained
    in the user's sense of environment that problems they cause are devilishly hard to debug.

gather
    An attribute of a step in a replicate workflow that specifies which replicates
    from previous steps will be summarized.

interface
    A subclass of queue manager used for a particular task execution.
    See *queue_manager*, below.

job
    A single instance of process execution suitable for submission to the queue_manager.
    For most workflows, there will be one job
    generated for each sample for each task. For CPU-intensive tasks, the *split* option
    can generate data parallellism by chromosome or region. For summary tasks, 
    the *jobgeneration = summary*
    option can generate a single job that collates all results of the previous step for reporting
    or analysis. For tumor-normal tasks, there is one job per tumor-normal pair.

job generation
    The act of creating the necessary execution scripts for a step of
    the pipeline, performed by JobGenerator.py (but usually invoked by
    BatchRunner.py behind the scenes). Job generation involves reading
    the config file and a METADATA file, a script template, and a samples file and producing
    executable scripts that can be run on the local CPU or submitted to
    the cluster.

jobgeneration
    A configuration term found in the protocol file or task file describing how
    SyQADA should generate jobs. Possible values are `generate`, `merge`, and `summary`.
    The special value `irregular` is used to indicate that the number of jobs generated
    can vary from sample to sample.

parameter, template parameter
    A term wrapped in braces in a template (e.g., {parameter1}). The parameter and braces
    will be substituted with a value defined in either the config file, the task file, or the
    protocol file for the step. Such a definition can itself embed another term in braces; however,
    circular references are not permitted and reference chains are discouraged.

.. _pipeline:

pipeline
    An automated sequence of processes requiring no human
    intervention.

    A brief manifesto on word usage: it seems that everyone in the
    bioinformatics world refers to a sequence of bioinformatic steps
    to analyze a set of data as a "pipeline."
    Because of computer science upbringing, my definition of a
    "pipeline" is an automated process, whereas a sequence of steps in which a
    user executes each step and studies the results before proceeding
    to the next step is a "workflow."  However, to communicate with
    everyone else in bioinformatics, I often equate them. You may see either term used
    in this documentation.

protocol
    A list of tasks in the order they should be performed, along with
    specifications of the variable parameters. This can either be a
    file containing the tasks and all their specifications, or a file
    that lists the taskfiles (.task) themselves, as well as possible parameter
    choices on a per-step basis. A workflow is governed
    by a protocol; through metonymy, protocol and workflow often stand for
    each other in this document.

QC step
    A quality control step, that is,
    a step in a workflow that measures some aspect of the performance of a previous
    step, whose output is therefore not directly part of the workflow's production.
    The task identifier for such a step should be prefaced with `QC` (as in
    `QC1-coverage`) so that SyQADA will set the subsequent task's *inputdir*
    appropriately.

.. _queue_manager:

queue_manager
    The controller that knows how to submit jobs for execution, check their
    statuses, and manage their output. There are currently three queue_managers,
    PBS, which runs jobs on the Nautilus cluster; LSF, which runs jobs on the Shark cluster; and LOCAL, which runs jobs on the local host.
    Since version 0.9.8, SyQADA identifies and chooses the cluster manager when running on a cluster node, and the
    LOCAL manager when not. Specifying "interface = LOCAL" in the METADATA will use the LOCAL manager
    when on the cluster.

replicate
    A single iteration of the pipeline generated by placing one value
    each of any of one or more designated interpolant terms into the
    METADATA file of that iteration. There is no limit on the number of
    replicate parameters or their values, but if the simple product of the number of values
    is high, it is likely to tax the capacity of
    the system and the patience of both the invoker and the cluster system
    administrators.

sample
    The name of some biological tissue or other bodily substance
    that has been extracted and measured to yield a file containing data
    that is to be analyzed in a workflow. Although this is often named for
    an individual, care should be used to distinguish the usage, because
    one individual can provide many samples. The obvious example is a
    tumor-normal pair of samples from one individual.

sample file
    A file containing a series of sample names, one to a line, that
    will be used to find data on which to perform a workflow. The term
    sample_file is available for use in script templates. Starting with 1.1,
    the sample file may now be a tab-delimited file containing other columns
    that provide phenotypic or other sample-specific attributes.

scatter
    A replication step that defines which parameters will be "scattered"
    across the task to create individual replicates.

script template
    A file that represents a Unix bash shell script with certain
    standard terms surrounded in braces (viz., ``{braces}``)
    indicating that they will be replaced by values found in the
    configuration file or computed by the JobGenerator during SyQADA
    initialization.

stderr, stdout
    The historical abbreviation of the names of the two standard outputs of a Unix process.
    The stderr and stdout for each job are captured in LOGS/SAMPLENAME.err and LOGS/SAMPLENAME.out.

task
    One step of a workflow, defined by a .task file that specifies
    a script template (usually in the workflows
    directory) with a name in the form task_name.template.
    .task files usually carry numeric prefixes to help provide
    a guide to their ordering, but a protocol file is a better guide
    to the proper ordering of the tasks in a workflow. For historical
    reasons, in SyQADA this is also called a batch.

task identifier
    The term assigned in the TASKDEF of a protocol step to identify its output directory for
    use as input in non-immediate subsequent steps. The format is:

        TASKDEF my_index_name path-to-task-definition.task

    which permits later reference using either

        inputdir = my_index_name

    or

        added_input = my_index_name

workflow
    A sequence of processes performed by a system. Many workflows are
    computerized, so that some steps are performed by machines and
    some steps, usually quality control, are performed by humans. A
    *pipeline* is a *workflow* performed entirely by machine. As of
    0.9.8, SyQADA will run workflows up to the point of first error,
    at which point they morph back into workflows.

workflows directory
    The directory under the SYQADA home that contains several nested workflow subdirectories,
    each of which contains the script templates for the tasks in that workflow. See :ref:`Workflows`.

working directory
    The directory in which SyQADA runs and creates its
    task subdirectories. A typical example is ``$PROJECT/working/alignment``.

YAGNI
    Ya Ain't Gonna Need It: The "extreme programming" design
    philosophy that governs SyQADA development. Only those features
    that are identified as necessary should be designed and
    implemented. e.g., SyQADA makes no provision for workflows with
    conditional paths, because the cases where that is appropriate
    don't seem to occur in our workflows, it would be difficult to
    implement, and it would make workflow specification more difficult
    than it already is.