CHAP Pipeline

To run a CHESS Analysis Pipeline (CHAP), you will need:

A CHAP configuration file in YAML format
A CHAP command line executable (CLI) executable

Run a CHAP pipeline by executing:

$ CHAP pipeline.yaml

How to run CHAP on the CHESS Linux system with centrally maintained workflow executables is discussed below.

Constructing a `CHAP` configuration file

CHAP configuration files must be in YAML format. At the top level, the file contains a single document, the document contains a single structure, the structure contains at least two keys, one of the keys must be config, and all other keys are pipeline names.

Example of a complete CHAP pipeline configuration file:

config:
  root: .
pipeline:
- common.YAMLReader:
    filename: data.yaml
- common.PrintProcessor

The `config` section

The config section contains the values of the instance variables for an instance of CHAP.models.RunConfig. It is techinically optional, but it should be included in every pipeline file for reproducibility / provenance. It can also be helpful for applying the same pipeline on many datasets, depending on how your dataset files are organized. The keys you can use in this section are:

Key	Description	Default value
`root`	Path to the working directory	Directory from which `CHAP` was run
`inputdir`	Path to a directory where all `Reader`s will look for files	Same value as `root`
`outputdir`	Path to a directory where all `Writer`s will write files	Same value as `root`
`interactive`	Flag to allow certain optional data / parameter checks that require user interaction to proceed. Not applicable to all pipelines, only to those which contain these optionally-interactive tools.	`false`
`log_level`	Name of a python logging level (not case sensitive)	`info`

Example config section containing all default values:

config:
  root: .
  inputdir: .
  outputdir: .
  interactive: false
  log_level: info

Pipeline sections

Sections with names that are not config are actual pipelines. A single CHAP configuration file may contain more than one pipeline. Each pipeline must be an list of Readers, Processors, and Writers (Pipelinetems) to execute consecutively, and configure the instance variables and other parameters for each one. To assemble your own pipeline configuration:

Decide which PipelineItems to use and in what order.
For each PipelineItem, refer to the Reference Guide (API documentation) to find out what instance variables it has. The Reference Guide also contain a description of every variable, its expected type, and its default value (for optional variables). Remember to include the instance variables for any object from which the relevant PipelineItem inherrits. For example, YAMLReader lists no instance variables, but it does inherit from Reader, which has filename, so YAMLReader also has the filename instance variable.

Example: `MapProcessor`

Suppose you want to configure a pipeline that collects all raw data from a CHESS dataset in a NeXus file, and that you already have a valid CHAP.common.models.map.MapConfig object for the dataset saved to a file named map_config.yaml. To create a suitable pipeline file:

Decide on the required PipelineItems. The pipeline will need a Reader that supports YAML files, a Processor that collects MapConfig data in a NeXus structure, and a Writer that supports NeXus files. So, the pipeline configuration looks like this to start:
```
pipeline:
- common.YAMLReader:
  TBD
- common.MapProcessor:
  TBD
- common.NexusWriter:
  TBD
```

Now, fill in all the TBD’s by referring to the Reference Guide for each PipelineItem to specify the instance variables.

pipeline:
- common.YAMLReader:
  filename: map_config.yaml
- common.MapProcessor:
  detector_config:
    detectors:
    - id: detector_id
      shape: [0, 0]
      attrs:
        foo: bar
- common.NexusWriter:
  filename: map_data.nxs

`CHAP` CLI usage

To diplay a description on how to use CHAP from the command line, execute:

$ CHAP --help

to get:

usage: PROG [-h] [-p [PIPELINE ...]] [--regex [{match,search,fullmatch}]]
            [--batch] [--batch-logdir LOGDIR]
            config

positional arguments:
  config                Input configuration file

options:
  -h, --help            show this help message and exit
  -p [PIPELINE ...], --pipeline [PIPELINE ...]
                        Pipeline name(s)
  --regex [{match,search,fullmatch}]
                        Name of Python RegEx function
                        (https://docs.python.org/3/howto/regex.html) to use
                        for matching configured pipeline names against the
                        string provided with the -p / --pipeline option.
  --batch               Enables "batch mode" operation where every sub-
                        pipeline is run in separate parallel processes. Log
                        files for each pipeline process will be created in the
                        directory specified with the `--batch-logdir` option.
  --batch-logdir LOGDIR
                        Destination directory for individual pipeline log
                        files when running multiple pipelines in batch mode.

Option	Description
`--pipeline` or `-p`	When more than one named pipeline configuration is present in a `CHAP` config file, `--pipeline` or `-p` can be used to specify a limited selection of the pipeline(s) from the file to be executed.
`--regex`	This option augments the behavior of `--pipeline` – the difference is that when `--regex` is used, the value of `--pipeline` specfies a regular expression pattern for selecting the names of pipeline(s) to run.
`--batch`	This option augments the behavior of `--pipeline` – the difference is that when `--batch` is used, the specified pipelines are executed individually and in parallel instead of being concatenated and executed as a single pipeline.

Example commands

Suppose pipeline.yaml contains:

config:
  root: .
pipeline_1:
- common.YAMLReader:
    filename: data_1.yaml
- common.PrintProcessor
pipeline_2:
- common.YAMLReader:
    filename: data_2.yaml
- common.PrintProcessor

Command	Behavior
`CHAP pipeline.yaml` or `CHAP pipeline.yaml --regex -p pipeline --regex`	Concatenate `pipeline_1` and `pipeline_2` and execute all items as a single pipeline.
`CHAP pipeline.yaml -p pipeline_1`	Execute `pipeline_1` only.
`CHAP pipeline.yaml --batch` or `CHAP pipeline.yaml -p pipeline --regex --batch`	Execute `pipeline_1` and `pipeline_2` in separate parallel processes, creating a log file for each one: `./CHAP_logs/pipeline_1.log` and `./CHAP_logs/pipeline_2.log`.

Python executables for `CHAP` on the CHESS Linux system

Running CHAP on the CHESS Linux system does not require users to create their own Conda environment or CHAP executables. Instead CHESS maintains regularly updated CHAP executables to run any of the maintained workflows located in the shared software releases directory for CHESS: /nfs/chess/sw/CHESS-software-releases. Specifically, production and development versions of the CHAP executables can be found in /nfs/chess/sw/CHESS-software-releases/prod and /nfs/chess/sw/CHESS-software-releases/dev, respectively.

Production version executables are updated each time a new tagged release is created for the main branch of the CHAP Github repository. Links to executables for the latest production version can be found in /nfs/chess/sw/CHESS-software-releases/prod, links to older releases can be found in subdirectories identified by its release version number. Release notes can be found here. The CHAP Reference Guide (API documentation) is also updated automatically with each new tagged release.

Development version executables are updated each time a new commit is pushed to the dev branch of the CHAP Github repository. Links to executables for the latest development version can be found in /nfs/chess/sw/CHESS-software-releases/dev.

For example, to run the Tomo workflow using the latest production release version, execute:

$ /nfs/chess/sw/CHESS-software-releases/prod/CHAP_tomo pipeline.yaml

or to run the EDD workflow using the latest development release version, execute:

$ /nfs/chess/sw/CHESS-software-releases/dev/CHAP_edd pipeline.yaml

You may find it convenient to add an alias to your ~/.bascrc or ~/.bash_aliases, for example for the CHAP Tomography workflow production release:

alias CHAP_tomo_prod='/nfs/chess/sw/CHESS-software-releases/prod/CHAP_tomo'

after which you can run the Tomo workflow using the latest production release version by simply executing:

$ CHAP_tomo_prod pipeline.yaml

Python environments for `CHAP` on any Linux system

Developing a user PipelineItem for CHAP or running CHAP on a Linux system other than the CHESS farm does require users to create their own Conda environment by taking the following steps:

Create a base Conda environent and clone the CHAP repository according to steps 1 and 2 of the Conda installation instructions.
Create a Conda environment suitable to your own PipelineItem or create a Conda environment for each workflow that you want to run.

For example, to create the SAXSWAXS Conda environment and run a SAXSWAXS workflow:

Activate your base Conda environment:

$ source <path_to_CHAP_clone_dir>/bin/activate

Create a Conda environment inside your base environment with:

(base) $ mamba env create -f <path_to_CHAP_clone_dir>/CHAP/saxswaxs/environment.yml

Activate the CHAP_saxswaxs environment:
```
(base) $ conda activate CHAP_saxswaxs
```
Try running:
```
(CHAP_saxswaxs) $ CHAP --help
```
to confirm that the package and the environment were installed correctly.
Navigate to your work directory.
Create the required CHAP pipeline file for the workflow (see above) and any additional workflow specific input files.
Run the workflow using your own CHAP_saxswaxs executable:

   (CHAP_saxswaxs) $ CHAP pipeline.yaml

CHAP Pipeline

Constructing a CHAP configuration file

The config section

Pipeline sections

Example: MapProcessor

CHAP CLI usage