Motivation for SyQADA

SyQADA is a system of python libraries and executables that make it possible to manage analysis projects over lots of data. The goal of SyQADA is to make it simple to:

Make processes reproducible:

Running the identical process on many inputs of the same type

Providing some measure of confidence upon completion that every step
of every task was performed on every input

Re-running the same process with altered parameters

Duplicating an existing experimental method on new data

Unify process development:

Running steps of the process either on the cluster or on the HAPS
(or on a mac)

Adapting the workflow of one analysis to support a new analysis

Simplify problem-solving:

Identifying data problems

Identifying and re-run jobs that fail because of system problems

Providing QC reports and visualization as a standard with any project

Support publication of analytical results:

Documenting what computational steps were performed for a given
project

Documenting what versions of software and what parameters were used
for a given project

Ultimately, producing MAGE-TAB-like IDF SDRF output to use in
publication of the experimental methods

Simplify data management:

Identifying the path of intermediate data through the workflow

Standardize the wrangling of intermediate filenames so common to
process workflows.

Removing easily reproducible intermediate data

Assist in time management:

Providing time estimates for individual steps and whole workflows.

Supporting the execution of an entire processing workflow with as
little user intervention as possible.

Philosophical Motivation

These ambitious goals are offset by the following desires:

Simple usage:

Keeping the usage of the tool as understandable as possible because
the inherent complexity of data management and bioinformatics
analysis does not need further confounding factors.

With SyQADA, there is little distinction between a regular user and
a power user. A power user of Unix will have an easier time dealing
with problems than a novice, but a SyQADA novice may well learn
every useful SyQADA command in the duration of his first SyQADA
workflow.

No unjustified generality:

SyQADA development adheres to the software engineering principle
YAGNI, that if You Ain't Gonna Need It, you should not implement
it. It is difficult enough to design and execute a good, useful,
and reliable workflow to complicate it with features and syntax
that obfuscate the simple; we prefer not to complicate our lives
unnecessarily.

All of the workflows that we have developed are simple linear
processes that require no decision-making based on results of
analysis. We believe that this is the typical case with research
workflows. The decision-making is an inherent part of the research,
and too dependent on "researcher intuition" to be reproducible. The
workflows need to conform to a repeatable protocol to produce
sufficient breadth of data to permit one to draw conclusions.

A common failing of workflow systems is to provide for arbitrary
complexity simply because it can be done. This leads to systems
that are unnecessarily hard to use and often confusing.  A recent
example I saw demonstrated included a graphic display of the
workflow that completely obfuscated the simple linear nature of the
workflow simply because the tool used to draw the display was a
general-purpose tool designed to display more complex graphs.

Similarly, depending on XML for specifications is imposing
unnecessary generality. XML is a wonderfully expressive language
that requires special expertise to parse, is inherently difficult
for a human to read, and would unnecessarily complicate the life of
both the user and the developer. I assure you that neither one of
us wants that.

Minimal magic:

The tool should avoid as much as possible making the user dependent
on it to accomplish the task. Using *syqada manage* makes it much
easier to determine the state of a task, but it is quite possible
to understand the task status using only the Unix `ls` command.
Similarly, if a user wishes, she can use SyQADA to generate the
jobs and then submit them manually.

The tool should also avoid making the developer dependent on
specialized knowledge other than basic programming. Thus, job
management information is stored in the file system rather than in
a database (even an embeddable one) so that neither user nor
developer need consider the use of SQL or noSQL. SQL, noSQL, XML,
and JSON have their places, but adding to the intellectual overhead
required to use this tool is not one of them.

Apologia

Although, or perhaps because, I have spent a substantial fraction of my career developing dynamic web-based systems, I have not seen a need for a point-and-click interface, because I believe that it is too easy to encounter a situation that requires direct manipulation of the file system. I think that a web interface that provided the kind of system access that would be necessary to deal with the problems that might occur in an embarassingly parallel clustered computation workflow would be so general that you would have to provide all the functionality of a Unix shell to make it usable. Better to avoid re-inventing the wheel by simply using a secure shell interface.

I believe furthermore that such a web interface would be so flexible and option-ridden that it would be on the one hand almost unusable, and on the other inherently impossible to secure against command insertion attacks like the incomparable Little Bobby Tables (http://xkcd.com/327):

_images/exploits_of_a_mom.png