SyQADA - A system for automating bioinformatic workflows

License:

SyQADA is made available under the GNU GPL version 3 license
(http://www.gnu.org/copyleft/gpl.html) a text copy of which is available in
this bundle.

SyQADA provides a system for creating, running, and monitoring the progress of each step of a workflow, based on a project configuration that includes a sample file listing sample names, a tool-suite configuration file, and a protocol file that identifies the tasks to be performed and designates a script template for each step. The Motivation for SyQADA section discusses the reasons for SyQADA’s existence.

Here’s where convention asks for a Quick Start Guide. Abandon All Hope Ye Who Enter Here: I regret to say that the SyQADA Quick Start Guide is pathetic. (for those who have already abandoned hope: Scheet Cheat Sheet) I don’t see how to explain how to manage a bioinformatics workflow in three easy steps. Before going further, I suggest you at least read the Caveats.

SyQADA’s only dependencies are python2.7, bash, and the Unix operating system (in addition to the kernel, SyQADA invokes a small number of standard Unix commands). SyQADA relies on the Unix file system to record its progress and allow users to understand that progress easily. SyQADA is designed to simplify, to the extent possible, construction of the scripts necessary to run an analysis project on a set of data representing some set of samples. It cannot eliminate the kinds of problems one faces in largescale computation, but I hope you find that it simplifies dealing with them.

SyQADA strives to simplify organization of analysis projects and create an environment in which it is easy to reproduce a workflow. SyQADA creates a standard file structure for each step and names error and output files appropriately for each sample. It creates scripts that can be run manually or using SyQADA either on a local Unix machine (including MacOS X) or on the MDACC clusters (the Nautilus cluster runs PBS, and the Shark cluster runs LSF — the references to clusters in this document usually identify them by their cluster management software, i.e., PBS or LSF). The cluster interfaces have certain queue size settings specific to MDACC, but they can surely be modified for different clusters with little work. This is currently feature ticket #344 in our local tracker.

Several existing workflows are included that perform divers sequence and variant analysis tasks, including GATK-based sequence alignment and recalibration; GATK variant calling; calling of somatic variants (mutect, indelocator); vtools annotation of variants using a variety of genomic resources; birdseed and haploh; GATK and haplohseq; download of TCGA data; etc.

Creation of new workflows is fairly straightforward. Script templates are relatively simple to construct (using a simple text editor such as emacs or vi) from a working example invocation of a computation. Some help is provided in this manual.

Contents

Indices and tables