Quickstart

Given a VCF, and BAM files for the samples of that VCF, hairpin2 will return a VCF with variants flagged in the FILTER column with ADF if variants have anomalous distributions indicating that they are likely to be artefactual, ALF if relevant reads have lower median alignment score per base than a specified threshold, DVF if variants appear to be the result of PCR error, and LQF if the variant is largely supported by low quality reads.

Installation

The easiest end-user approach is to install into a virtual environment:

python -m venv .env
source .env/bin/activate
pip install .
hairpin --help

Usage

The recommended usage is to provide a config of flag parameters along with the VCF in question and the relavant alignment/s (.sam/.bam/.cram), like so:

vcf_sample_name="TUMOUR"
aln="aln.cram"
hairpin2 \
  -c myconfig.toml \
  -m '{"$vcf_sample_name":"$aln"}' \
  variants.vcf \
  $aln > output.vcf

A config of default parameters is provided in example-configs/default-params.toml. Both TOML and JSON based configs are supported.

The default parameters provided are those found to be appropriate on series of LCMB data. Your use case, data, and opinions may differ - hairpin2 is extensively customisable via the config, and descriptions of the parameters can be found in the Processes and Parameters section of the guide.

See the guide for a complete walkthrough, and the interface section below for a description of the command line options.

Command Line Interface

Usage: hairpin2 [-h, --help] [OPTIONS] VCF ALIGNMENTS...

  read-aware artefactual variant flagging algorithms. Flag variants in VCF
  using statistics calculated from supporting reads found in ALIGNMENTS and
  emit the flagged VCF to stdout.

Usage: hairpin2 [-h, --help] [OPTIONS] VCF ALIGNMENTS...

  read-aware artefactual variant flagging algorithms. Flag variants in VCF
  using statistics calculated from supporting reads found in ALIGNMENTS and
  emit the flagged VCF to stdout.

Options:
  -v, --version                   Show the version and exit.
  -c, --config FILEPATH           path to config TOML/s or JSON/s from which
                                  processes and execution will be configured.
                                  May be provided multiple times; individual
                                  configs can be provided for each top level
                                  key (params, exec).
  -o, --output-config FILEPATH    log run configuration back to a new JSON
                                  file.
  --set <TEXT TEXT>...            Override values in the supplied config. Uses
                                  dot paths for the key, and JSON format for
                                  the value, e.g., --set
                                  params.DVF.read_loss_threshold 0.6. Must be
                                  provided after --config. May be provided
                                  multiple times.
  -m, --name-mapping JSON_STRING | FILEPATH
                                  If sample names in VCF differ from SM tags
                                  in alignment files, provide a key here to
                                  map them. Accepts a path to a JSON file, or
                                  JSON-formatted string of key-value pairs
                                  where keys are sample names in the VCF and
                                  all values are either the SM tag or the
                                  filepath of the relevant alignment - e.g.
                                  '{"sample0": "PDxxA", "sample1": "PDxxB"}'
                                  or '{"sample0": "A.bam", ...}'. When only a
                                  single alignment is provided, also accepts a
                                  JSON-spec top-level array of possible sample
                                  of interest names - e.g.
                                  '["TUMOR","TUMOUR"]'. Note that when
                                  providing a JSON-formatted string at the
                                  command line you must single quote the
                                  string, and use only double quotes
                                  internally.
  -r, --cram-reference FILEPATH   path to FASTA format CRAM reference,
                                  overrides $REF_PATH and UR tags for CRAM
                                  alignments.
  -q, --quiet                     be quiet (-q to not log INFO level messages,
                                  -qq to additionally not log WARN).
  -p, --progress                  display progress bar on stderr during run.
  -h, --help                      Show this message and exit.

To expand on --name-mapping – when using multisample VCFs, hairpin2 compares VCF sample names found in the VCF header to SM tags in alignments to match samples of interest to the correct alignment. If these IDs are different between the VCF and alignments, you’ll need to provide a JSON key. If there are multiple samples of interest in a multisample VCF, and therefore it is necessary to provide multiple alignments, you will need to provide a mapping for each pair - e.g. -m '{"sample1":"SM1", "sample2":"SM2", ...}' or -m '{"sample1:"1.bam", ...}'. If there is only one sample of interest, and therefore only one alignment is provided to the tool, then you also have an optional shorthand - you need only indicate which VCF sample is the sample of interest, e.g. -m '["TUMOR"]'. When there is only one sample of interest, and therefore one alignment, but the sample of interest may have one of several possible names, you may also provide a comma separated string of possible names for the sample of interest, e.g. -m '["TUMOR", "TUMOUR"]' - users have found this valuable for high throughput workflows where the VCF input may be coming from one of several prior tools or callers (which may name samples differently). In all cases, there must be one and only one match between each alignment and VCF sample. In all cases, a path to a JSON file may be provided instead of the JSON string. Note that a VCF containing both a TUMOUR and a NORMAL sample contains 2 samples, and therefore is a multisample VCF.

hp2-utils

hairpin2 comes with a daughter tool, hp2-utils. hp2-utils provides some additional functionality to assist with the usage of hairpin2.

Usage: hp2-utils [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  explain-var  explain a hairpin2 flagging decision
  get-params   get run parameters from a VCF

explain-var

explain-var operates on any of the INFO fields set by hairpin2 in the output VCF, revealing the specific reasoning as to the decision the tool made for a given flag. This is very useful when tuning hairpin2 to new data, or querying results about which you are curious. Usage is as follows:

hp2-utils explain-var "ADF=G|PASS|0x77|19|BOTH"  # an example INFO string

returning:

FLAG: ADF
ALT: G
variant outcome was PASS via conditions ['NO_TESTABLE_READS', 'INSUFFICIENT_READS', 'EDGE_CLUSTERING', 'BOTH_STRAND_DISTRIB_BOTH', 'BOTH_STRAND_DISTRIB_ONE', 'MIN_NON_EDGE'] on strand BOTH
reads examined: 19

More information can be found at Understanding Decisions.

get-params

hairpin2 stores all run parameters in a reduced representation in the header of the output VCF. get-params allows translation of the stored parameters back into a JSON, which can then be used for other hairpin2 runs. Usage of get-params is as follows:

hp2-utils get-params my.hairpin2.vcf > parameters_used.json

More information can be found at Reproducibility & Parameter Distribution.