.. _syqada_tutorial_features: Tutorial on Special Features ============================ This tutorial is designed to illustrated several features that have been found to be useful. Their use does significantly complicate the life of the analyst using them, so you may wish to adopt them with caution. In particular, using replication is drinking absynthe with Baudelaire, and protocol nesting is a Russian fairy tale. Combine them, and you're in a story by E.T.A. Hoffman. If you thought :ref:`simple_stupid_workflow` was simple and stupid, you have another horror waiting for you here. Running ``syqada tutorial`` and selecting the Features tutorial will populate your tutorial directory with a control directory and the protocol, config, and samples files. Start by examining :code:`control/Features.protocol`. You'll see that it is version 1.4, which allows the use of all the "advanced" features in this tutorial (versioning of protocols has persisted since the transition from SyQADA-0 to SyQADA-1.0, although empirical evidence suggests that it's a Bridge Too Far, so it should probably be eliminated). If you execute syqada describe in your tutorial directory, syqada assumes that you are invoking the protocol found in the control directory, so you will see something like this:: Using these additional valid queues: 'testing' Protocol control/Features.protocol Description: Protocol for tutorial demonstration of advanced features: replication, iteration, nesting, complexity, quality assurance Preamble Protocol nesting is True Valid queues: testing Replicands: ['parameterA', 'parameterB', 'parameterC', 'parameterD'] Tasks 01-simple-replication Description: (Obviously,) scatter is parameterA,parameterD 02-partial-aggregation Description: parameterD is gathered; parameterA is still scattered 03-another-replication Description: parameterB is added to scatter with parameterA 03-QC-step1 Description: scatter is still parameterA,parameterB 03-QC-step-repeat Description: This tests passing inputs per-replication, a common condition of replicating an existing protocol 04-aggregation Description: Both scattered parameters are explicitly gathered 0501-spin-date Description: Iterate copies of a file containing iteration count and date of execution. 100 copies by default, should be overwritten as 10 by the tutorial protocol. 0502-stats Description: Summarize the output of the spin-date task. 06-summary Description: Demonstrate use of nested protocol output. We can use a nested task in added_input. The preamble of the protocol includes two comment lines that provide a description. The next three lines define the parameters to be replicated. These lines must appear in the protocol before the first TASKDEF. The protocol describes five tasks plus two inherited from a nested protocol. The template of the first of them is:: echo PA{parameterA}, PD{parameterD} > {output_prefix}.out which, of course, just creates an output file for each sample listing the parameters for that replication. The third one simply greps a particular value from the first output set. Note the :code:`echo` command, which causes the script to succeed whether grep finds anything or not:: grep PB4 {inputdir}/{sample}.out > {output_prefix}.out echo ignore the error code The config file contains a dummy source directory, because this protocol does not require data. The samples file identifies two samples. When you run:: syqada auto --init you should see a series of directories, with parameters 1 and 4 varying over two values each :: 01-simple-replication-pa1_1-pa4_16@~dummy_ 01-simple-replication-pa1_1-pa4_7 01-simple-replication-pa1_2-pa4_16@~dummy_ 01-simple-replication-pa1_2-pa4_7 Note the task names that include the @ and _ characters, which are substitutions for the colon and the division symbol to prevent clashes with Unix conventions for hostname specification and file separator. .. _replication_structure: Replication Directory Structure ------------------------------- For the example replication, with two parameters varying over two values in the first and third steps, one parameter varying over two values in the second partial aggregation step, and the aggregation step, these batch directories will be created when the whole protocol is complete:: 01-simple-replication-pa1_1-pa4_16@~dummy_ 01-simple-replication-pa1_1-pa4_7 01-simple-replication-pa1_2-pa4_16@~dummy_ 01-simple-replication-pa1_2-pa4_7 02-partial-aggregation-pa1_1 02-partial-aggregation-pa1_2 03-another-replication-pa1_1-pa2_3 03-another-replication-pa1_1-pa2_4 03-another-replication-pa1_2-pa2_3 03-another-replication-pa1_2-pa2_4 03-QC-step1-pa1_1-pa2_3 03-QC-step1-pa1_1-pa2_4 03-QC-step1-pa1_2-pa2_3 03-QC-step1-pa1_2-pa2_4 04-aggregation In addition, thereafter, the Features.protocol uses a PROTOCOLREF to include a two-step nested protocol, which demonstrates the use of iteration, and then a final step to demonstrate reference of one of the nested tasks:: 0501-spin-date 0502-stats 06-summary The name of a replication directory may be parsed to identify the values of parameters used in each replication. The parameter names are abbreviated to three characters, made unique by numbering the third character if necessary (don't bother to test the system by using 10 parameter names that share the first two characters, it will break and you're going to have an excessive-compute problem anyway. That is absolutely a YAGNI (:ref:`glossary`) beyond the scope of our development mandate). Each replicate directory contains a METADATA file that includes replicate information, e.g.:: replicate: parameterA = 1 parameterD = 8 Also look at control/Features.replication, which now contains the value sets you defined plus a map of the replicate numbers to the permutations of the parameters. This is not as useful right now as it might be if it included the abbreviated names of the parameters. It is unused by syqada, but you might devise a way to take advantage of it in an elaborate aggregation or reporting step. You can now run:: syqada auto --project Features This simple workflow should have no difficulty running to completion (but, pending a bug-fix, it will -- you will need to respond to prompts with a "y" to get through the replicates in the first task, possibly after repeating syqada auto if it stops). The gather steps, steps 02 and 04, as well as step 03, which inherits the scatter remaining in step 02, use regular expressions that comprise all the output directories of the previous step to formulate their :code:`inputdir`. For example, here is a fragment of the job runner for step 04:: #!/bin/bash ... (echo PA1: ; grep -c PA1 {03-another-replication-pa1_1-pa2_3/output,03-another-replication-pa1_1-pa2_4/output,03-another-replication-pa1_2-pa2_3/output,03-another-replication-pa1_2-pa2_4/output}/*.out) > 04-aggregation/output/Features.aggregate (echo PB4: ; grep -c PB4 {03-another-replication-pa1_1-pa2_3/output,03-another-replication-pa1_1-pa2_4/output,03-another-replication-pa1_2-pa2_3/output,03-another-replication-pa1_2-pa2_4/output}/*.out) >> 04-aggregation/output/Features.aggregate (echo PA1: ; grep -c PA1 {02-partial-aggregation-pa1_1/output,02-partial-aggregation-pa1_2/output}/*.aggregate) >> 04-aggregation/output/Features.aggregate echo Ignore the error code ... As you can see, it generates some pretty ghastly-long command-lines within the shell script, but it does what you need done, and I, at least, wouldn't want to write them myself. Feel free to examine the resulting structure, metadata, and scripts. Note that each replicate of the QC step knows how to identify its single predecessor. I have no idea what would happen if you aggregated during a QC step. Exercise left to the reader. We will make no attempt to find out until the use case arrives at our door.