Reference Genomes
How to declare a reference-genome for use
Table of contents
You may configure a list to describe available reference genomes in the form of an array
of objects
. Each object requires the following information:
- version
<str>
- The version string to be used to access this reference genome in the pipeline input files.
E.g.GRCh38
- file_path
<array [str]>
- An array containing the decomposed location of the dataset to be used in the analysis. See the note below for additional information.
E.g.["/", "reference", "human", "GRCh38.fa.gz"]
We use the built-in python function
os.path
to generate platform-specific paths. Should you wish to provide a path from root, you may do so by setting the first element in the array to the drive reference for your OS. **Linux E.g. [”/”, …]**
Example "reference_genome"
entry
{
"reference_genome": [
{
"version": "GRCh38",
"file_path": ["/", "reference", "human", "GRCh38.fa.gz"]
},
{
"version": "GRCh37",
"file_path": ["/", "reference", "human", "GRCh37.fa.gz"]
}
]
}
Performance Tips
Users are encouraged to use compression and indexing for performance gain when using reference genomes. Block compression (BGZIP), such as that provided by SamTools, can be used to compress a wide variety of bioinformatics file formats, including
FASTA
files. In order to decompress the blocks created through Block compression, you will also need to create an apropriate index file describing the contents of each block.An example fileset for the
GRCH38
reference genome would include:
Should I include the index files in my
config.json
?The accomanying index files need only be named and stored alongside the compressed file. They do not need to be listed in the
reference_genome
configuration entry.
Environment-related options
The VCF Validation Workflow supports several environmental-related options which are set in the config/config.json
as follows:
environment
(Object
)
This object contains the configuration for all infrastructure-related configurations. These include:
email
(Object
)
If your PBS/Torque systems email notifications have been configured, you may configure a notification email as follows:
- email
<str [Email]>
- An email address to which the notification should be sent.
- conditions
[ <enum ['o', 'e']> ]
- An array of mail-options which indicates when you should receive a notification email for this pipeline execution.
a
indicates mail should be sent when job is aborted,b
indicates mail should be sent when job begins ande
indicates mail should be sent when job terminates.
Example 'email'
entry
{
"email": {
"email": "jane.doe@university.com",
"conditions": ["o", "e"],
}
}
queues
The PBS-Torque batch scheduling system manages per-installation generic resources like memory, time, and cpus which are available for request by users. The VCF Validation Workflow has been designed to take advantage of the granularity provided by these scheduler systems. Each rule in the workflow can be split into a separate job submissions on a cluster. As a result, it is possible to parallelize the analysis and assign cluster resources on a per-rule basis.
To do this, you may use the queue
key to describing the available PBS-Torque resources and queues you would like to use. These can be described as follows:
Custom core and node selections
In some cases, users might want to run some jobs on multiple nodes and some on single nodes. To support this, you may declare the same underlying queue multiple times with a different
queue
key in the config file and create multiple versions of the same underlying hardware queue.
It is recommended that you submit the workflow execution script with the longest available walltime as this will create a watcher process who is responsible for queueing each rule and monitoring their states. If this process is interupted, the workflow will cease.
- queue
<str>
- The name of the queue.
- walltime
<str>
- The maximum walltime jobs on this queue are permitted to execute in a HH:MM:SS format.
E.g. "900:00:00" = 37.5 days - memory
<str>
- The amount of RAM available on this queue.
E.g. 128G - cores
<str>
- The number of cores available on this queue.
E.g. 10 - nodes
<str>
- The number of nodes available in this queue.
E.g. 1 - rules
<array [<str>]>
- An array of rules this rule should be used for. For a reference of rules, please reference the rules list included in the example below.
Example 'queues'
entry
{
"queues": [
{
"queue": "long",
"walltime": "900:00:00",
"memory": "128G",
"cores": "10",
"nodes": "1",
"rules": [
"all",
"VALIDATE",
"LIFTOVER",
"COLLATE",
"ALL_COLLATE",
"ANNOTATE",
"ADMIXTURE",
"TRIM_AND_NAME",
"FILTER",
"TRANSPILE_CLUSTERS",
"PLINK"
]
}
]
}
envmodules
The envmodules
key allows users to provide Environment Modules accessor names. These are used internally by snakemake to execute the required module load
commands before queued rule execution. module load
name accessors will be needed for the following command-line tools:
- plink-2
- plink-1.9
- bcftools
- samtools
- piccard
- structure
- admixture-1.3
- python-3
- r
- latex
Example 'envmodules'
entry
{
"envmodules": {
"plink-2": "plink-2",
"plink-1.9": "plink-1.9",
"bcftools": "bcftools",
"samtools": "samtools",
"piccard": "piccard",
"structure": "structure",
"admixture-1.3": "admixture-1.3",
"python-3": "python-3",
"r": "r",
"latex": "latex"
}
}