Tutorial on Special Features

This tutorial is designed to illustrated several features that have been found to be useful. Their use does significantly complicate the life of the analyst using them, so you may wish to adopt them with caution. In particular, using replication is drinking absynthe with Baudelaire, and protocol nesting is a Russian fairy tale. Combine them, and you’re in a story by E.T.A. Hoffman.

If you thought A Simple, Stupid Workflow was simple and stupid, you have another horror waiting for you here.

Running syqada tutorial and selecting the Features tutorial will populate your tutorial directory with a control directory and the protocol, config, and samples files.

Start by examining control/Features.protocol. You’ll see that it is version 1.4, which allows the use of all the “advanced” features in this tutorial (versioning of protocols has persisted since the transition from SyQADA-0 to SyQADA-1.0, although empirical evidence suggests that it’s a Bridge Too Far, so it should probably be eliminated).

If you execute

syqada describe

in your tutorial directory, syqada assumes that you are invoking the protocol found in the control directory, so you will see something like this:

Using these additional valid queues: 'testing'
Protocol control/Features.protocol
Description: Protocol for tutorial demonstration of advanced features:
  replication, iteration, nesting, complexity, quality assurance
Preamble
  Protocol nesting is True
  Valid queues: testing
  Replicands: ['parameterA', 'parameterB', 'parameterC', 'parameterD']
Tasks
  01-simple-replication
      Description: (Obviously,) scatter is parameterA,parameterD
  02-partial-aggregation
      Description: parameterD is gathered; parameterA is still
      scattered
  03-another-replication
      Description: parameterB is added to scatter with parameterA
  03-QC-step1
      Description: scatter is still parameterA,parameterB
  03-QC-step-repeat
      Description: This tests passing inputs per-replication, a common
      condition of replicating an existing protocol
  04-aggregation
      Description: Both scattered parameters are explicitly gathered
  0501-spin-date
      Description: Iterate copies of a file containing iteration count
      and date of execution.  100 copies by default, should be
      overwritten as 10 by the tutorial protocol.
  0502-stats
      Description: Summarize the output of the spin-date task.
  06-summary
      Description: Demonstrate use of nested protocol output.  We can
      use a nested task in added_input.

The preamble of the protocol includes two comment lines that provide a description. The next three lines define the parameters to be replicated. These lines must appear in the protocol before the first TASKDEF. The protocol describes five tasks plus two inherited from a nested protocol. The template of the first of them is:

echo PA{parameterA}, PD{parameterD} > {output_prefix}.out

which, of course, just creates an output file for each sample listing the parameters for that replication. The third one simply greps a particular value from the first output set. Note the echo command, which causes the script to succeed whether grep finds anything or not:

grep PB4 {inputdir}/{sample}.out > {output_prefix}.out
echo ignore the error code

The config file contains a dummy source directory, because this protocol does not require data. The samples file identifies two samples.

When you run:

syqada auto --init

you should see a series of directories, with parameters 1 and 4 varying over two values each

01-simple-replication-pa1_1-pa4_16@~dummy_
01-simple-replication-pa1_1-pa4_7
01-simple-replication-pa1_2-pa4_16@~dummy_
01-simple-replication-pa1_2-pa4_7

Note the task names that include the @ and _ characters, which are substitutions for the colon and the division symbol to prevent clashes with Unix conventions for hostname specification and file separator.

Replication Directory Structure

For the example replication, with two parameters varying over two values in the first and third steps, one parameter varying over two values in the second partial aggregation step, and the aggregation step, these batch directories will be created when the whole protocol is complete:

01-simple-replication-pa1_1-pa4_16@~dummy_
01-simple-replication-pa1_1-pa4_7
01-simple-replication-pa1_2-pa4_16@~dummy_
01-simple-replication-pa1_2-pa4_7
02-partial-aggregation-pa1_1
02-partial-aggregation-pa1_2
03-another-replication-pa1_1-pa2_3
03-another-replication-pa1_1-pa2_4
03-another-replication-pa1_2-pa2_3
03-another-replication-pa1_2-pa2_4
03-QC-step1-pa1_1-pa2_3
03-QC-step1-pa1_1-pa2_4
03-QC-step1-pa1_2-pa2_3
03-QC-step1-pa1_2-pa2_4
04-aggregation

In addition, thereafter, the Features.protocol uses a PROTOCOLREF to include a two-step nested protocol, which demonstrates the use of iteration, and then a final step to demonstrate reference of one of the nested tasks:

0501-spin-date
0502-stats
06-summary

The name of a replication directory may be parsed to identify the values of parameters used in each replication. The parameter names are abbreviated to three characters, made unique by numbering the third character if necessary (don’t bother to test the system by using 10 parameter names that share the first two characters, it will break and you’re going to have an excessive-compute problem anyway. That is absolutely a YAGNI (Glossary) beyond the scope of our development mandate).

Each replicate directory contains a METADATA file that includes replicate information, e.g.:

replicate:
  parameterA = 1
  parameterD = 8

Also look at control/Features.replication, which now contains the value sets you defined plus a map of the replicate numbers to the permutations of the parameters. This is not as useful right now as it might be if it included the abbreviated names of the parameters. It is unused by syqada, but you might devise a way to take advantage of it in an elaborate aggregation or reporting step.

You can now run:

syqada auto --project Features

This simple workflow should have no difficulty running to completion (but, pending a bug-fix, it will – you will need to respond to prompts with a “y” to get through the replicates in the first task, possibly after repeating syqada auto if it stops). The gather steps, steps 02 and 04, as well as step 03, which inherits the scatter remaining in step 02, use regular expressions that comprise all the output directories of the previous step to formulate their inputdir. For example, here is a fragment of the job runner for step 04:

#!/bin/bash
...
(echo PA1: ; grep -c PA1 {03-another-replication-pa1_1-pa2_3/output,03-another-replication-pa1_1-pa2_4/output,03-another-replication-pa1_2-pa2_3/output,03-another-replication-pa1_2-pa2_4/output}/*.out) > 04-aggregation/output/Features.aggregate
(echo PB4: ; grep -c PB4 {03-another-replication-pa1_1-pa2_3/output,03-another-replication-pa1_1-pa2_4/output,03-another-replication-pa1_2-pa2_3/output,03-another-replication-pa1_2-pa2_4/output}/*.out) >> 04-aggregation/output/Features.aggregate
(echo PA1: ; grep -c PA1 {02-partial-aggregation-pa1_1/output,02-partial-aggregation-pa1_2/output}/*.aggregate) >> 04-aggregation/output/Features.aggregate
echo Ignore the error code
...

As you can see, it generates some pretty ghastly-long command-lines within the shell script, but it does what you need done, and I, at least, wouldn’t want to write them myself.

Feel free to examine the resulting structure, metadata, and scripts.

Note that each replicate of the QC step knows how to identify its single predecessor. I have no idea what would happen if you aggregated during a QC step. Exercise left to the reader. We will make no attempt to find out until the use case arrives at our door.