Replication Tutorial

If you thought A Simple, Stupid Workflow was simple and stupid, you have another horror waiting for you here.

Running syqada tutorial and selecting the REPLICATION tutorial will populate your tutorial directory with a control directory and the protocol, config, and samples files.

Start by examining control/REPLICATION_TEST.protocol. You’ll see that it is version 1.3, which is the minimum required version number for using replication (versioning of protocols has persisted since the transition from SyQADA-0 to SyQADA-1.0, although empirical evidence suggests that it’s a Bridge Too Far). The next three lines define the parameters to be replicated. These lines must appear in the protocol before the first TASKDEF. The protocol describes three tasks. The template of the first of them is:

echo PA{parameterA}, PD{parameterD} > {output_prefix}.out

which, of course, just creates an output file for each sample listing the parameters for that replication. The third one simply greps a particular value from the first output set. Note the echo command, which causes the script to succeed whether grep finds anything or not:

grep PB4 {inputdir}/{sample}.out > {output_prefix}.out
echo ignore the error code

The config file contains a dummy source directory, because this protocol does not require data. The samples file identifies two samples.

When you run:

syqada auto --protocol REPLICATION_TEST --init

you should see a series of directories.

Replication Directory Structure

For the example replication workflow, with two parameters varying over two values in the first and third steps, one parameter varying over two values in the second partial aggregation step, and one final aggregation step, these batch directories will be created:

01-simple-replication-pa1_1-pa4_7
01-simple-replication-pa1_1-pa4_8
01-simple-replication-pa1_2-pa4_7
01-simple-replication-pa1_2-pa4_8
02-partial-aggregation-pa1_1
02-partial-aggregation-pa1_2
03-another-replication-pa1_1-pa2_3
03-another-replication-pa1_1-pa2_4
03-another-replication-pa1_2-pa2_3
03-another-replication-pa1_2-pa2_4
03-QC-step1-pa1_1-pa2_3
03-QC-step1-pa1_1-pa2_4
03-QC-step1-pa1_2-pa2_3
03-QC-step1-pa1_2-pa2_4
04-aggregation

The name of a replication directory may be parsed to identify the values of parameters used in each replication. The parameter names are abbreviated to three characters, made unique by numbering the third character if necessary (don’t bother to test the system by using 10 parameter names that share the first two characters, it will break and you’re going to have an excessive-compute problem anyway. That is absolutely a YAGNI beyond the scope of our development mandate).

Each replicate directory contains a METADATA file that includes replicate information, e.g.:

replicate:
  parameterA = 1
  parameterD = 8

Also look at control/Replication.config, which now contains the value sets you defined plus a map of the replicate numbers to the permutations of the parameters. This is not as useful right now as it might be if it included the abbreviated names of the parameters. It is unused by syqada, but you might devise a way to take advantage of it in an elaborate aggregation or reporting step.

You can now run:

syqada auto --project REPLICATION_TEST

This simple workflow should have no difficulty running to completion. The gather steps, steps 02 and 04, as well as step 03, which inherits the scatter remaining in step 02, use regular expressions that comprise all the output directories of the previous step to formulate their inputdir. For example, here is a fragment of the job runner for step 04:

#!/bin/bash
...
(echo PA1: ; grep -c PA1 {03-another-replication-pa1_1-pa2_3/output,03-another-replication-pa1_1-pa2_4/output,03-another-replication-pa1_2-pa2_3/output,03-another-replication-pa1_2-pa2_4/output}/*.out) > 04-aggregation/output/REPLICATION_TEST.aggregate
(echo PB4: ; grep -c PB4 {03-another-replication-pa1_1-pa2_3/output,03-another-replication-pa1_1-pa2_4/output,03-another-replication-pa1_2-pa2_3/output,03-another-replication-pa1_2-pa2_4/output}/*.out) >> 04-aggregation/output/REPLICATION_TEST.aggregate
(echo PA1: ; grep -c PA1 {02-partial-aggregation-pa1_1/output,02-partial-aggregation-pa1_2/output}/*.aggregate) >> 04-aggregation/output/REPLICATION_TEST.aggregate
echo Ignore the error code
...

As you can see, it generates some pretty ghastly-long command-lines within the shell script, but it does what you need done, and I, at least, wouldn’t want to write them myself.

Feel free to examine the resulting structure, metadata, and scripts.

Note that each replicate of the QC step knows how to identify its single predecessor. I have no idea what would happen if you aggregated during a QC step. Exercise left to the reader. We will make no attempt to find out until the use case arrives at our door.