Reference Genomes

How to declare a reference-genome for use

Table of contents

Environment-related options
1. environment (Object)

You may configure a list to describe available reference genomes in the form of an array of objects. Each object requires the following information:

version <str>: The version string to be used to access this reference genome in the pipeline input files.
E.g. GRCh38
file_path <array [str]>: An array containing the decomposed location of the dataset to be used in the analysis. See the note below for additional information.
E.g. ["/", "reference", "human", "GRCh38.fa.gz"]

We use the built-in python function os.path to generate platform-specific paths. Should you wish to provide a path from root, you may do so by setting the first element in the array to the drive reference for your OS. **Linux E.g. [”/”, …]**

Example "reference_genome" entry

{
  "reference_genome": [
    {
      "version": "GRCh38",
      "file_path": ["/", "reference", "human", "GRCh38.fa.gz"]
    },
    {
      "version": "GRCh37",
      "file_path": ["/", "reference", "human", "GRCh37.fa.gz"]
    }
  ]
}

Performance Tips

Users are encouraged to use compression and indexing for performance gain when using reference genomes. Block compression (BGZIP), such as that provided by SamTools, can be used to compress a wide variety of bioinformatics file formats, including FASTA files. In order to decompress the blocks created through Block compression, you will also need to create an apropriate index file describing the contents of each block.

An example fileset for the GRCH38 reference genome would include:

GRCh38.fa.gz.gzi file (Can be generated during compression) ([Samtools]>(http://www.htslib.org/doc/bgzip.html))

GRCh38.dict file (Samtools)

GRCh38.fa.gz.fai file (Samtools)

Should I include the index files in my config.json?

The accomanying index files need only be named and stored alongside the compressed file. They do not need to be listed in the reference_genome configuration entry.

The VCF Validation Workflow supports several environmental-related options which are set in the config/config.json as follows:

`environment` (`Object`)

This object contains the configuration for all infrastructure-related configurations. These include:

`email` (`Object`)

If your PBS/Torque systems email notifications have been configured, you may configure a notification email as follows:

email <str [Email]>: An email address to which the notification should be sent.
conditions [ <enum ['o', 'e']> ]: An array of mail-options which indicates when you should receive a notification email for this pipeline execution. a indicates mail should be sent when job is aborted, b indicates mail should be sent when job begins and e indicates mail should be sent when job terminates.

Example 'email' entry

{
    "email": {
        "email": "jane.doe@university.com",
        "conditions": ["o", "e"],
    }
}

`queues`

The PBS-Torque batch scheduling system manages per-installation generic resources like memory, time, and cpus which are available for request by users. The VCF Validation Workflow has been designed to take advantage of the granularity provided by these scheduler systems. Each rule in the workflow can be split into a separate job submissions on a cluster. As a result, it is possible to parallelize the analysis and assign cluster resources on a per-rule basis.

To do this, you may use the queue key to describing the available PBS-Torque resources and queues you would like to use. These can be described as follows:

Custom core and node selections

In some cases, users might want to run some jobs on multiple nodes and some on single nodes. To support this, you may declare the same underlying queue multiple times with a different queue key in the config file and create multiple versions of the same underlying hardware queue.

It is recommended that you submit the workflow execution script with the longest available walltime as this will create a watcher process who is responsible for queueing each rule and monitoring their states. If this process is interupted, the workflow will cease.

queue <str>: The name of the queue.
walltime <str>: The maximum walltime jobs on this queue are permitted to execute in a HH:MM:SS format.
E.g. "900:00:00" = 37.5 days
memory <str>: The amount of RAM available on this queue.
E.g. 128G
cores <str>: The number of cores available on this queue.
E.g. 10
nodes <str>: The number of nodes available in this queue.
E.g. 1
rules <array [<str>]>: An array of rules this rule should be used for. For a reference of rules, please reference the rules list included in the example below.

Example 'queues' entry

{
  "queues": [
    {
      "queue": "long",
      "walltime": "900:00:00",
      "memory": "128G",
      "cores": "10",
      "nodes": "1",
      "rules": [
        "all",
        "VALIDATE",
        "LIFTOVER",
        "COLLATE",
        "ALL_COLLATE",
        "ANNOTATE",
        "ADMIXTURE",
        "TRIM_AND_NAME",
        "FILTER",
        "TRANSPILE_CLUSTERS",
        "PLINK"
      ]
    }
  ]
}

`envmodules`

The envmodules key allows users to provide Environment Modules accessor names. These are used internally by snakemake to execute the required module load commands before queued rule execution. module load name accessors will be needed for the following command-line tools:

plink-2
plink-1.9
bcftools
samtools
piccard
structure
admixture-1.3
python-3
r
latex

Example 'envmodules' entry

{
  "envmodules": {
    "plink-2": "plink-2",
    "plink-1.9": "plink-1.9",
    "bcftools": "bcftools",
    "samtools": "samtools",
    "piccard": "piccard",
    "structure": "structure",
    "admixture-1.3": "admixture-1.3",
    "python-3": "python-3",
    "r": "r",
    "latex": "latex"
  }
}

Reference Genomes

Environment-related options

environment (Object)

email (Object)

queues

envmodules

`environment` (`Object`)

`email` (`Object`)

`queues`

`envmodules`