Data Requirements

What files do I need to provide as inputs for this analysis?

Datasets

Sample Metadata

Genomic Location Metadata

Transcript Selection

Table of contents
  1. Overview
  2. Reference Genome
  3. Datasets & Dataset Files
    1. Compression and Indexing
  4. Metadata Declarations
    1. Datasets
      1. Data requirements
    2. Samples
      1. Data requirements
    3. Genomic Locations
      1. Data requirements
    4. Transcripts
      1. Data requirements

This page lists the informational requirements needed to execute the Pharmacogenetics Analysis Pipeline. Below is a simplified overview diagram indicating the full list of required information. For more information, you may consult the relevant section below which contains more advanced explanations, discussions and technical documentation for each requirement and any formatting involved.

Overview

Data Requirements Diagram
---
title: Data Requirements
---
flowchart TB
  subgraph Standard ["Standard Resources"]
            genomeFasta[/"Reference Genome\nGRCh38 (FASTA)"/]
        end
        subgraph projectSpecific ["Project specific data"]
            %% Use LR to invert axis set by parent to effectively force relative "TB"
            direction LR
            subgraph data ["Variant input data"]
                datasetFiles[/"Datasets (VCF)"/]
            end
            subgraph metadata ["Analysis metadata"]
                %% Use LR to invert axis set by parent to effectively force relative "TB"
                direction LR

                datasetMeta[/"Datasets metadata\n(CSV)"/]
                locationMeta[/"Genomic location\nmetadata (CSV)"/]
                sampleMeta[/"Sample metadata\n(CSV)"/]
                transcriptMeta[/"Transcript metadata\n(CSV)"/]
            end
        end

Reference Genome

Due to the nature of reference sequences, they are not included as pipeline inputs and are, as a result, not handled in this section. For additional information, please consult the Reference Genome section on the pipeline configuration page.

Datasets & Dataset Files

Please provide all input datasets in the form of variant-call-format or .vcf files. The latest version of the VCF specification can be found here.

Compression and Indexing

Due to the nature of bioinformatics and genomics, datasets are often quite large in uncompressed form. Users are welcome to compress their data files for additional performance and administrative ease-of-use.

If you wish to compress your VCF files, please provide the following files as input:

  • BGZIP-compressed VCF file (.vcf.gz or vcf.bgz)
  • Tabix Index (.vcf.gz.tbi or .vcf.bgz.tbi)

This pipeline is designed to accept .vcf.gz files produced by Block Compression (BGZIP). This is a non-standard type of compression which is not typically the default on Windows or MacOS. It is used to compress .vcf files in a series of blocks or chunks and can be done using many popular bioinformatics tools.

Normally, block-compression alone would only make your data file smaller. To facilitate more efficient usage of computational resources, you can also create a Tabix Index. This is an accompanying index file to BGZIP-compressed .vcf.gz files which contains an index indicating the bounds of each compression block relative to the genomic coordinates and variant IDs in the dataset, making targeted decompression much more efficient.

Both block-compression and tabix indexing are provided as part of SamTools.

Metadata Declarations

To run the Pharmacogenetics Analysis Pipeline, you will need to provide some additional contextual information. All metadata is provided in the form of appropriately named ` .csv` files located in the input directory.

Case sensitivity

The following metadata declaration files use case-sensitive column names.


Datasets

The datasets.csv file allows you to declare datasets and provide the necessary dataset-level information for use in this pipeline.

Data requirements

dataset_name <str>
The name of the dataset. This value will be used as a universal accessor for that dataset and any information relating to it. This means that any output files will use this value to determine things like filenames, etc. It is also used to connect other metadata to this dataset computationally, E.g. sample-level information.
E.g. 1000G
reference_genome <str>
An enum indicating which reference genome version this dataset has been called on.
E.g. GRCh37 or GRCh38
file <file_path>
A file path indicating the location of the dataset to be used in the analysis.
E.g. GRCh37 or GRCh38
datasets.csv data example
dataset_name reference_genome file
HG002 GRCh38 /nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz
HG002 GRCh38 /nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz
HG002 GRCh38 /nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz

Samples

The samples.csv file allows you to declare samples and provide the necessary sample-level information for use in this pipeline.

Data requirements

sample_name <str>
The ID of the sample. this should correspond to the sample ID's provided in the provided .vcf file.
E.g. HG002
dataset <enum [dataset_name]>
The name of the dataset this sample belongs to. This value should correspond to the provided dataset ID listed in datasets.csv
E.g. 1000g
* <str>
A file path indicating the location of the dataset to be used in the analysis.
E.g. GRCh37 or GRCh38
samples.csv data example
sample_name dataset SUPER SUB
HG002 HG002 EUR GBR
HG002 HG003 AFR GWD
HG002 HG004 SAS GIH

Genomic Locations

The locations.csv file allows you to declare samples and provide the necessary sample-level information for use in this pipeline.

Data requirements

location_name <str>
The ID of a gene or, if not a studied gene region, a unique identifier to reference this genomic coordinate window.
E.g. CYP2A6
chromosome <enum <int [0-24]> >
The chromosome number on which the above genomic region can be found.
E.g. 19
start <int>
The start coordinates for the genomic window.
E.g. 40842850
stop <int>
The stop coordinates for the genomic window.
E.g. 1000g
strand <enum [-1,1]>
The strand on which the genomic region can be found, where 1 denotes the forward strand and -1 denotes the reverse strand.
E.g. -1
locations.csv data example
location_name chromosome start stop strand
CYP2A6 19 40842850 40851138 -1
CYP2B6 19 40988570 41021110 1
UGT2B7 4 69045214 69112987 1

Transcripts

The transcripts.csv file allows you to declare which transcripts you would like to use when performing variant-effect-prediction.

During the execution of the Pharmacogenetics Analysis Pipeline, variant-effect-prediction (VEP) is performed using a publicly accessible VEP query API by E! Ensembl. Currently, the API returns multiple VEP predictions based on any transcripts that are present at a given genomic location. Users are able to provide a transcripts.csv input file to declare a list of transcripts per genomic-region they would like to consider for this analysis.

Transcript IDs

Please use transcripts listed on the E! Ensembl Database

Multiple Transcripts

If more than one transcript is provided for a given genomic region, we will attempt to match the transcripts available in the order that is provided from top to bottom. The first successful VEP transcript match between the users selection and that provided by E! Ensembl will be selected, and if no transcripts provided are available, the first available transcript result will be selected.

Data requirements

gene_name <enum [str]>
The name of the gene a transcript describes. This key should match the gene or region name provided in the locations.csv file.
E.g. HG002
transcript_id <str>
The name of the transcript in question. This value will be used to query the E! Ensembl database when performing variant-effect-prediction.
E.g. NM_000762.6
transcripts.csv data example
gene_name transcript_id
CYP2A6 NM_000762.6
CYP2A6 ENST00000600495.1
CYP2A6 ENST00000596719.5
CYP2A6 ENST00000599960.1
CYP2B6 NM_000767.5
CYP2B6 ENST00000593831.1
CYP2B6 ENST00000598834.2
CYP2B6 ENST00000597612.1
CYP2B6 ENST00000594187.1
UGT2B7 NM_001074.4
UGT2B7 ENST00000508661.5
UGT2B7 ENST00000622664.1
UGT2B7 ENST00000502942.5
UGT2B7 ENST00000509763.1

This work is licensed under a Creative Commons Attribution 4.0 International License.. This project is managed by the Institute for Cellular and Molecular Medicine at the University of Pretoria.