.. _syqada_manage:

###################
syqada manage
###################

>>> syqada manage *batchroot*

runs batch_tool, which will show the current state of the batch in the given task
directory, including any batch management problems. A batch management problem is a
circumstance usually caused by premature termination of SyQADA while it is managing
the execution of a batch. In this case, the progress of jobs currently in the PBS
queue obviously cannot be recorded, and repair work will become necessary.
An explanation of what `syqada manage` does is found in :ref:`manage_method`.

If there are batch management problems, they may be shown in detail
by adding the --detail option with one or more of the additional arguments,
*done*, *failed*, or *output*.
[I haven't used this in a long time, and there are no unit/integration
tests for it, so it may behave oddly.]

If batch management problems do exist, they can
be fixed by adding the --fix option with one or more of the same additional
arguments.

Note that jobs left in running (or queued or stuck) that are terminated forcibly
after the SyQADA batch manager has quit running will not be recognized by *syqada manage*
and must be moved by hand. Since these jobs need to be re-run, the tactic I use
is::

  mv *batchroot*/running/* *batchroot*/error
  syqada batch *batchroot* --step repend

moving them to *batchroot*/error and then using --step repend instead of simply
moving them to *batchroot*/pending allows SyQADA to trim off the old process id,
which would otherwise confuse SyQADA upon restart.

  :ref:`syqada_batch` XXX --step rerun

is supposed to address this, but a bug frequently crops up, so the
method above is more reliable.

Examples of output
------------------

A batch that has completed successfully will produce results that look like::

  > syqada manage  task-directory/
  0.9.9

  checking control directories... ............................................
  checking logs... ...........................................................
  syqada-0.9.9: task 0102-varscan
  jobs 88, queues  pending 0,  running 0,  done 88,  error 0
                   ,           ,  begun 88,  done 88,  failed 0, outputs 88
  88 of 88 required jobs completed.
  batch completed

Obviously, there are other conditions. A batch in progress will produce results that look like::

  > syqada manage task-directory/
  0.9.9

  checking control directories... ............................................
  checking logs... .......
  syqada-0.9.9: task 0802-varscan
  jobs 88, queues  pending 81,  running 0,  done 7,  error 0
                 ,           ,  begun 7,  done 7,  failed 0, outputs 7
  batch can resume

A batch that has failed will produce results that look like::

  > syqada manage 01-phase-samples
  0.9.8.3

  Checking control directories... .........................................
  .........................................................................
  Checking logs... .......................................
  syqada-0.9.8.3: Task 01-phase-samples
  Jobs 2948, Queues  PENDING 2615,  RUNNING 0,  DONE 0,  ERROR 333
                   ,           ,  begun 333,  done 0,  failed 333, outputs 0
  Batch in error

In certain cases, because of the timing of SyQADA managing the completing jobs, you may
see a message saying "batch needs curation" with a description of discrepancies. This is
harmless as long as *syqada* (*auto* or *batch*) is still running. if SyQADA has
terminated, and you see the message "batch needs curation," you can run with the
*--details* parameter to get more information. for example::

  >  syqada manage task-directory/ --details done

to show you exactly which jobs marked done have not been properly managed. *done* is
not the only option. other options to the detail parameter are explained on the
*batch_tool.py* page.

To curate the batch *after* *syqada batch* has terminated, you
can run with the *--fix* parameter. for example::

  >  syqada manage task-directory/ --fix done failed output

This will curate all jobs that::

  were still in state running but had indicated that they had completed (done)
  were still in state running but had indicated that they had had an error (failed)
  were in state done but did not have the same number of outputs as other completed jobs (output)

.. _manage_method:

#######################
How syqada manage Works
#######################

`syqada manage` does the following things::

* Determines from the task definition how many jobs it should expect for this task.
* Tabulates and counts the contents of the PENDING, RUNNING, ERROR, and DONE directories (these should have only job scripts in them).
* Tabulates and counts the contents of the LOGS directory, expecting one .begun, one .out, and one .err file, and either a .done file or a .failed file for each job.
* If an error state is detected, the error classification routine below is invoked.

These conditions must be met to declare a task complete:

* All its job scripts must reside in the DONE directory.
* There must be a .done file in the LOGS for each run-suffix.
* There must be at least one file in the output directory matching the
  run-suffix of each script
* All jobs must have the same number of outputs in the output directory.
* As a by-product of the counting method, syqada is likely to have an ungracious response if any of the log files is missing.

Error classification
--------------------

SyQADA error classification makes the reasonable assumption to begin
with that error outputs with the same number of lines probably have
the same cause, and likely only differ by sample name. It verifies
this by comparing all erroroutputs (the python *set* object makes this
trivial) and counting the unique sets of output, both as generated,
and with the sample names removed. It then categorizes the results,
describes them, and an instance of each class of error message is
selected for display.

This has proved immensely useful because of the time it saves in error resolution, often
as much as the rest of syqada put together.