Data Requirements
What files do I need to provide as inputs for this analysis?
Datasets
Sample Metadata
Genomic Location Metadata
Transcript Selection
Table of contents
This page lists the informational requirements needed to execute the Pharmacogenetics Analysis Pipeline. Below is a simplified overview diagram indicating the full list of required information. For more information, you may consult the relevant section below which contains more advanced explanations, discussions and technical documentation for each requirement and any formatting involved.
Overview
Data Requirements Diagram
---
title: Data Requirements
---
flowchart TB
subgraph Standard ["Standard Resources"]
genomeFasta[/"Reference Genome\nGRCh38 (FASTA)"/]
end
subgraph projectSpecific ["Project specific data"]
%% Use LR to invert axis set by parent to effectively force relative "TB"
direction LR
subgraph data ["Variant input data"]
datasetFiles[/"Datasets (VCF)"/]
end
subgraph metadata ["Analysis metadata"]
%% Use LR to invert axis set by parent to effectively force relative "TB"
direction LR
datasetMeta[/"Datasets metadata\n(CSV)"/]
locationMeta[/"Genomic location\nmetadata (CSV)"/]
sampleMeta[/"Sample metadata\n(CSV)"/]
transcriptMeta[/"Transcript metadata\n(CSV)"/]
end
end
Reference Genome
Due to the nature of reference sequences, they are not included as pipeline inputs and are, as a result, not handled in this section. For additional information, please consult the Reference Genome section on the pipeline configuration page.
Datasets & Dataset Files
Please provide all input datasets in the form of variant-call-format or .vcf
files. The latest version of the VCF specification can be found here.
Compression and Indexing
Due to the nature of bioinformatics and genomics, datasets are often quite large in uncompressed form. Users are welcome to compress their data files for additional performance and administrative ease-of-use.
If you wish to compress your VCF files, please provide the following files as input:
- BGZIP-compressed VCF file (
.vcf.gz
orvcf.bgz
) - Tabix Index (
.vcf.gz.tbi
or.vcf.bgz.tbi
)
This pipeline is designed to accept
.vcf.gz
files produced by Block Compression (BGZIP). This is a non-standard type of compression which is not typically the default on Windows or MacOS. It is used to compress.vcf
files in a series of blocks or chunks and can be done using many popular bioinformatics tools.Normally, block-compression alone would only make your data file smaller. To facilitate more efficient usage of computational resources, you can also create a Tabix Index. This is an accompanying index file to BGZIP-compressed
.vcf.gz
files which contains an index indicating the bounds of each compression block relative to the genomic coordinates and variant IDs in the dataset, making targeted decompression much more efficient.Both block-compression and tabix indexing are provided as part of SamTools.
Metadata Declarations
To run the Pharmacogenetics Analysis Pipeline, you will need to provide some additional contextual information. All metadata is provided in the form of appropriately named ` .csv` files located in the input directory.
Case sensitivity
The following metadata declaration files use case-sensitive column names.
Datasets
The datasets.csv
file allows you to declare datasets and provide the necessary dataset-level information for use in this pipeline.
Data requirements
- dataset_name
<str>
- The name of the dataset. This value will be used as a universal accessor for that dataset and any information relating to it. This means that any output files will use this value to determine things like filenames, etc. It is also used to connect other metadata to this dataset computationally, E.g. sample-level information.
E.g.1000G
- reference_genome
<str>
- An
enum
indicating which reference genome version this dataset has been called on.
E.g.GRCh37
orGRCh38
- file
<file_path>
- A file path indicating the location of the dataset to be used in the analysis.
E.g.GRCh37
orGRCh38
datasets.csv
data example
dataset_name | reference_genome | file |
---|---|---|
HG002 | GRCh38 | /nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz |
HG002 | GRCh38 | /nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz |
HG002 | GRCh38 | /nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz |
Samples
The samples.csv
file allows you to declare samples and provide the necessary sample-level information for use in this pipeline.
Data requirements
- sample_name
<str>
- The ID of the sample. this should correspond to the sample ID's provided in the provided
.vcf
file.
E.g.HG002
- dataset
<enum [dataset_name]>
- The name of the dataset this sample belongs to. This value should correspond to the provided dataset ID listed in
datasets.csv
E.g.1000g
* <str>
- A file path indicating the location of the dataset to be used in the analysis.
E.g.GRCh37
orGRCh38
samples.csv
data example
sample_name | dataset | SUPER | SUB |
---|---|---|---|
HG002 | HG002 | EUR | GBR |
HG002 | HG003 | AFR | GWD |
HG002 | HG004 | SAS | GIH |
Genomic Locations
The locations.csv
file allows you to declare samples and provide the necessary sample-level information for use in this pipeline.
Data requirements
- location_name
<str>
- The ID of a gene or, if not a studied gene region, a unique identifier to reference this genomic coordinate window.
E.g.CYP2A6
- chromosome
<enum <int [0-24]> >
- The chromosome number on which the above genomic region can be found.
E.g.19
- start
<int>
- The start coordinates for the genomic window.
E.g.40842850
- stop
<int>
- The stop coordinates for the genomic window.
E.g.1000g
- strand
<enum [-1,1]>
- The strand on which the genomic region can be found, where
1
denotes the forward strand and-1
denotes the reverse strand.
E.g.-1
locations.csv
data example
location_name | chromosome | start | stop | strand |
---|---|---|---|---|
CYP2A6 | 19 | 40842850 | 40851138 | -1 |
CYP2B6 | 19 | 40988570 | 41021110 | 1 |
UGT2B7 | 4 | 69045214 | 69112987 | 1 |
Transcripts
The transcripts.csv
file allows you to declare which transcripts you would like to use when performing variant-effect-prediction.
During the execution of the Pharmacogenetics Analysis Pipeline, variant-effect-prediction (VEP) is performed using a publicly accessible VEP query API by E! Ensembl. Currently, the API returns multiple VEP predictions based on any transcripts that are present at a given genomic location. Users are able to provide a transcripts.csv
input file to declare a list of transcripts per genomic-region they would like to consider for this analysis.
Transcript IDs
Please use transcripts listed on the E! Ensembl Database
Multiple Transcripts
If more than one transcript is provided for a given genomic region, we will attempt to match the transcripts available in the order that is provided from top to bottom. The first successful VEP transcript match between the users selection and that provided by E! Ensembl will be selected, and if no transcripts provided are available, the first available transcript result will be selected.
Data requirements
- gene_name
<enum [str]>
- The name of the gene a transcript describes. This key should match the gene or region name provided in the
locations.csv
file.
E.g.HG002
- transcript_id
<str>
- The name of the transcript in question. This value will be used to query the E! Ensembl database when performing variant-effect-prediction.
E.g.NM_000762.6
transcripts.csv
data example
gene_name | transcript_id |
---|---|
CYP2A6 | NM_000762.6 |
CYP2A6 | ENST00000600495.1 |
CYP2A6 | ENST00000596719.5 |
CYP2A6 | ENST00000599960.1 |
CYP2B6 | NM_000767.5 |
CYP2B6 | ENST00000593831.1 |
CYP2B6 | ENST00000598834.2 |
CYP2B6 | ENST00000597612.1 |
CYP2B6 | ENST00000594187.1 |
UGT2B7 | NM_001074.4 |
UGT2B7 | ENST00000508661.5 |
UGT2B7 | ENST00000622664.1 |
UGT2B7 | ENST00000502942.5 |
UGT2B7 | ENST00000509763.1 |