System for Quality Assured Data Analysis

Our System for Quality-Assured Data Analysis (SyQADA) is a workflow management system that seeks to improve reproducibility in as simple a framework as feasible. Numerous researchers have used SyQADA workflows for dozens of projects.

SyQADA allows the simple specification of protocols that marshal and manage input data, allowing analyses to be run repeatably over large volumes of data for many samples and for different projects. Use of SyQADA requires Unix, python3, and no other software besides a text editor and the analysis programs used in the workflow. SyQADA exposes no programming interface. Configuration files are simple property lists; the protocol of a workflow is expressed as a list of task definitions and attributes. To date SyQADA has supported analysis for the following: whole exome and 409-gene panel DNA sequence, RNA, 16s Ribosomal DNA, and whole genome microbiome sequence, and SNP6 arrays, using Affymetrix, Illumina, and Ion Torrent technologies. Analyses included the Affymetrix birdsuite, the gatk best practices variant calling workflow, hapLOH, hapLOHseq, Ion Reporter, JLOH, mach, mutect, pairwise phasing, the Tuxedo suite and varscan2 as well as numerous small programs for pre- and post-processing data.

As our lab prepared for the 2017 American Association for Cancer Research (AACR) conference, SyQADA provided upstream analysis support for 5 AACR posters, among 8 analysts, executing well over 100 pipeline instances comprising about 350 tasks, processing nearly 89,000 jobs that consumed over 70 CPU-months from two dedicated laboratory servers as well as one of the institutional computing clusters. One analyst, who had had no Unix experience before beginning her analyses, used an existing workflow for sample-wise cross-correlation of microbiome prevalence. She single-handedly managed six different datasets that consumed 15 CPU-months of computing in just over 3000 jobs. After some brief initial training, the only user support she required was, "Yes, start yet another one, you aren't overworking our machine." Together, the five projects analyzed well over 2700 samples from about 800 individuals, including more than 1100 tumor-normal pairs, over most of the data types described above.

SyQADA is freely available with a GNU GPL version 3 license

SyQADA was developed by Jerry Fowler and Anthony San Lucas with design help and advice from Paul Scheet.

The README for this release may be examined here: 00-README.txt

The manual for this release may be examined here: SyQADA 2.0 RC2 manual

