SyQADA - A system for automating bioinformatic workflows

License:

SyQADA is made available under the GNU GPL version 3 license
(http://www.gnu.org/copyleft/gpl.html) a text copy of which is available in
this bundle.

SyQADA provides a system for creating, running, and monitoring the progress of each step of a workflow, based on a project configuration that includes a sample file listing sample names, a tool-suite configuration file, and a protocol file that identifies the tasks to be performed and designates a script template for each step.

The reasons for SyQADA’s existence are discussed in Motivation for SyQADA.

Here’s where convention asks for a Quick Start Guide. Abandon All Hope Ye Who Enter Here: I regret to say that the SyQADA Quick Start Guide is pathetic. (for those who have already abandoned hope: Scheet Cheat Sheet) I don’t see how to explain how to manage a bioinformatics workflow in three easy steps. Before going further, I suggest you at least read the Caveats.

SyQADA’s only dependencies are python3.5+ish, bash, and the Unix operating system (in addition to the kernel, SyQADA invokes a small number of standard Unix commands). SyQADA relies on the Unix file system to record its progress and allow users to understand that progress easily. SyQADA is designed to simplify, to the extent possible, construction of the scripts necessary to run an analysis project on a set of data representing some set of samples. It cannot eliminate the kinds of problems one faces in largescale computation, but I hope you find that it simplifies dealing with them.

SyQADA strives to simplify organization of analysis projects and create an environment in which it is easy to reproduce a workflow. SyQADA creates a standard file structure for each step and names error and output files appropriately for each sample. It creates scripts that can be run manually or using SyQADA either on a local Unix machine (including MacOS X) or on the MDACC clusters (the Nautilus cluster runs PBS, and the Shark cluster runs LSF — the references to clusters in this document usually identify them by their cluster management software, i.e., PBS or LSF). The cluster interfaces expect to find local settings for queue size in the resources directory (so that an external user can, we hope, modify only a specific set of named constants and adapt SyQADA to a different queueing policy. The ones provided are specific to MDACC,

Several existing workflows are included that perform divers sequence and variant analysis tasks, including GATK-based sequence alignment and recalibration; GATK variant calling; calling of somatic variants (mutect, indelocator); vtools annotation of variants using a variety of genomic resources; birdseed and haploh; GATK and haplohseq; download of TCGA data; etc.

Creation of new workflows is fairly straightforward. Script templates are relatively simple to construct (using a simple text editor such as emacs or vi) from a working example invocation of a computation. Some help is provided in this manual.

Contents

Indices and tables