Configuration & Data

A summary of the required data and input files needed to perform an analysis.

Table of contents
  1. Overview
  2. Input Data
    1. Compression and Indexing
  3. Analysis configuration
    1. Metadata
      1. datasets.csv Metadata
      2. samples.csv Metadata
      3. locations.csv Metadata
      4. transcripts.csv Metadata

This page describes the information needed to run the Pharmacogenetics Analysis Workflow. Below we guide users through the system used to declare an analysis manifest, and all associated metadata files. For more information, please consult the relevant section below which contains more specific guidance, discussions and technical documentation.

Overview

This workflow makes use of an analysis manifest to encapsulate all analysis variables used. This manifest file collects and connects the metadata for your samples, datasets, and relevant reference resources (Reference Genomes, etc) together. Doing so allows the workflow to programmatically access clusters through sample annotations, which is required in order to produce cluster-level reports.

Input Data Infographic
---
title: Input filemap
config:
    flowchart:
        defaultRenderer: elk
    elk:
        nodePlacementStrategy: BRANDES_KOEPF
---
flowchart TB
  subgraph input [input files]
      subgraph data [Datasets]
          datasetFile1{{<b>Dataset file</b><br><code>GnomAD_Chr1.vcf.gz</code>}}
          datasetFile2{{<b>Dataset file</b><br><code>GnomAD_Chr2.vcf.gz</code>}}
          datasetFileN{{<b>Dataset file</b><br><code>GnomAD_ChrN...vcf.gz</code>}}
      end

      subgraph metadata [Analysis Metadata]
          locationMeta{{<b>Coordinates for study</b><br><code>locations.csv</code>}}
          sampleMeta{{<b>Sample metadata</b><br><code>samples.csv</code>}}
          datasetMeta{{<b>Data files to incude</b><br><code>datasets.csv</code>}}
          transcriptMeta{{<b>Transcript preferences</b><br><code>transcripts.csv</code>}}
      end
  end
  subgraph resources [<code>resources/</code>]
      reference_genome{{Reference Genome <br> <code>resources/genome_version_name.fa</code>}}
  end
  subgraph config [<code>config/</code>]
    configuration{{<b>Analysis configuration</b> <br><code>config/configuration.json</code>}}
  end

  vcf_validation_workflow[\VCF Validation Workflow/]
  click vcf_validation_workflow href "https://tuks-icmm.github.io/VCF-Validation-Workflow/workflow/methodology" _blank

  pharmacogenetic_analysis_workflow[\Pharmacogenetics Analysis Workflow/]
  click pharmacogenetic_analysis_workflow href "/workflow/methodology" _blank

  population_structure_workflow[\Population structure Workflow/]
  click population_structure_workflow href "https://tuks-icmm.github.io/Population-Structure-Workflow/workflow/methodology" _blank

  datasetMeta -...-o|Referenced in| reference_genome

  metadata -.-o|Describes| data

  input --> vcf_validation_workflow
  config ----> vcf_validation_workflow
  resources ----> vcf_validation_workflow

  vcf_validation_workflow --> pharmacogenetic_analysis_workflow

  pharmacogenetic_analysis_workflow --> population_structure_workflow
  pharmacogenetic_analysis_workflow --> results

  population_structure_workflow --> results

  results(((Results)))

Input Data

This workflow is designed to work on variant-call-format files (.vcf file extension). The latest version of the VCF specification can be found here.

Compression and Indexing

This workflow can accept uncompressed VCF files, however this workflow will compress and index the data during handling for performance reasons. If possible, please provide them in compressed and index form.

This workflow uses an alternative handling file format

Since a large portion of this workflow makes use of Plink-2, this workflow is configured to convert the files into Plink-2’s binary fileset (.pgen, .pvar and .psam files) offered by Plink-2, which offers very good processing performance and tracking improvements. We do also perform an additional reference-guided allele verification to remove swapped alleles from data tracked using Plink-1.9 binary files.

Analysis configuration

To perform an analysis with this workflow, users will need to configure the workflow. This includes providing environment-related information like output locations, as well as analysis settings like reference population selection. This information is all declared and stored using the config/manifest.json file.

The manifest.json file

This file is responsible for declaring all information relating to the analysis and serves as the central point of contact between the workflow runtime and your input data. It is also used to configure and synchronize any sub-workflows imported internally.

manifest.json format example
input <object>
datasets <Array<str>>
A list representing the file-path to the dataset metadata file. Should be suitable for use with the python os.path.join() function.
locations <Array<str>>
A list representing the file-path to the location metadata file. Should be suitable for use with the python os.path.join() function.
samples <Array<str>>
A list representing the file-path to the samples metadata file. Should be suitable for use with the python os.path.join() function.
transcripts <Array<str>>
A list representing the file-path to the transcript metadata file. Should be suitable for use with the python os.path.join() function.
output <Array<str>>
A list representing a path to a folder where the results of the analysis should be stored. If the folder does not exist, it will be created.
resources <Object>
reference_genomes <Array<Object>>
This property should contain a list of objects, where each object describes a reference genome available for use, using teh following properties:
name <str>
The name of the reference genome. Should correspond to value used in dataset metadata file.
location <Array<str>>
A list representative the file-path to the reference genome. Should be provided in FASTA format.
parameters <Object>
fishers-test <object>
cluster_name* <str>
The name of the cluster-level declared in your sample metadata file for which you would like to declare a reference population. This population will be used to conduct pair-wise testing against all remaining populations in the column respectively.
  {
    "input": {
        "datasets": [
            "/",
            "path",
            "to",
            "my",
            "dataset",
            "metadata"
        ],
        "locations": [
            "/",
            "path",
            "to",
            "my",
            "locations",
            "metadata"
        ],
        "samples": [
            "/",
            "path",
            "to",
            "my",
            "samples",
            "metadata"
        ],
        "transcripts": [
            "/",
            "path",
            "to",
            "my",
            "transcripts",
            "metadata"
        ]
    },
    "output": [
        "/",
        "path",
        "to",
        "my",
        "output",
        "location"
    ],
    "resources": {
        "reference_genomes": [
            {
                "name": "",
                "location": [
                    "/",
                    "path",
                    "to",
                    "my",
                    "reference",
                    "genome"
                ]
            }
        ]
    },
    "parameters": {
        "fishers-test": {
            "my_cluster": "my_population_of_interest"
        }
    }
}

Metadata

All data and sample metadata is provided in the form of ` .csv files declared in the manifest.json` file. These files allow you to declare datasets and provide the necessary information to determine which contig-level files should be used for analysis given the provided genomic coordinates. For convenience, we will assume standard names for the sake of this explanation:

This design-pattern of declaring metadata files via the manifest.json was chosen specifically to allow users to create and store analysis configurations and metadata alongside data, which often has special storage requirements (e.g. space, access, etc). Through the manifest.json file, all other analysis-specific files will be declared and will be accessible. This then only requires that the manifest.json file is discoverable under the path config/manifest.json, which can be accomplished with a symlink or shortcut, keeping the amount of setup work to a minimum.

datasets.csv Metadata

The dataset metadata file allows you to declare information about your datasets to analyze, including the reference genome version and where to locate the files.

Please provide data in the form of multiple *.vcf files split per-contig.

Format example
dataset_name <str>
The name of the dataset. This value will be used as a universal accessor for that dataset and any information relating to it. This means that any output files will use this value to determine things like filenames, etc. It is also used to connect other metadata to this dataset computationally, E.g. sample-level information.
E.g. 1000G
reference_genome <str>
An enum indicating which reference genome version this dataset has been called on.
E.g. GRCh37 or GRCh38
file <file_path>
A file path indicating the location of the dataset to be used in the analysis.
E.g. GRCh37 or GRCh38
dataset_name reference_genome file
HG002 GRCh38 /nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz
HG002 GRCh38 /nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz
HG002 GRCh38 /nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz

samples.csv Metadata

The sample metadata file allows you to declare samples and provide the necessary sample-level information for use in this pipeline.

Format example

Case Sensitive

The following metadata declaration files use case-sensitive column names.

sample_name <str>
The ID of the sample. this should correspond to the sample ID's provided in the provided .vcf file.
E.g. HG002
dataset <enum [dataset_name]>
The name of the dataset this sample belongs to. This value should correspond to the provided dataset ID listed in datasets.csv
E.g. 1000g
* <str>
A file path indicating the location of the dataset to be used in the analysis. Please note that the column names are case-sensitive.
E.g. GRCh37 or GRCh38
sample_name dataset SUPER SUB
HG002 HG002 EUR GBR
HG002 HG003 AFR GWD
HG002 HG004 SAS GIH

locations.csv Metadata

The location metadata file allows you to declare samples and provide the necessary sample-level information for use in this pipeline.

Format example
location_name <str>
The ID of a gene or, if not a studied gene region, a unique identifier to reference this genomic coordinate window.
E.g. CYP2A6
chromosome <enum <int [0-24]> >
The chromosome number on which the above genomic region can be found.
E.g. 19
start <int>
The start coordinates for the genomic window.
E.g. 40842850
stop <int>
The stop coordinates for the genomic window.
E.g. 1000g
strand <enum [-1,1]>
The strand on which the genomic region can be found, where 1 denotes the forward strand and -1 denotes the reverse strand.
E.g. -1
location_name chromosome start stop strand
CYP2A6 19 40842850 40851138 -1
CYP2B6 19 40988570 41021110 1
UGT2B7 4 69045214 69112987 1

transcripts.csv Metadata

The transcript metadata file allows you to declare which transcripts you would like to use when performing variant-effect-prediction.

During the execution of the Pharmacogenetics Analysis Workflow, variant-effect-prediction (VEP) is performed using a publicly accessible VEP query API by E! Ensembl. Currently, the API returns multiple VEP predictions based on any transcripts that are found matching the requested genomic location. Users are able to provide a transcripts.csv input file to declare a list of transcripts per genomic-region they would like to consider for this analysis.

Transcript IDs

Please use transcripts listed on the E! Ensembl Database

Multiple Transcripts

If more than one transcript is provided for a given genomic region, we will attempt to match the transcripts available in the order that is provided from top to bottom. The first successful VEP transcript match between the users selection and that provided by E! Ensembl will be selected, and if no transcripts provided are available, the first available transcript result will be selected.

Format example
gene_name <enum [str]>
The name of the gene a transcript describes. This key should match the gene or region name provided in the locations.csv file.
E.g. HG002
transcript_id <str>
The name of the transcript in question. This value will be used to query the E! Ensembl database when performing variant-effect-prediction.
E.g. NM_000762.6
gene_name transcript_id
CYP2A6 NM_000762.6
CYP2A6 ENST00000600495.1
CYP2A6 ENST00000596719.5
CYP2A6 ENST00000599960.1
CYP2B6 NM_000767.5
CYP2B6 ENST00000593831.1
CYP2B6 ENST00000598834.2
CYP2B6 ENST00000597612.1
CYP2B6 ENST00000594187.1
UGT2B7 NM_001074.4
UGT2B7 ENST00000508661.5
UGT2B7 ENST00000622664.1
UGT2B7 ENST00000502942.5
UGT2B7 ENST00000509763.1


This work is licensed under a Creative Commons Attribution 4.0 International License.. This project is managed by the Institute for Cellular and Molecular Medicine at the University of Pretoria.