Workflow

A summary of the workflow itself and its analyses, broken down by topic.

Reference Genome Configuration


Introduction

The Pharmacogenetics Analysis Pipeline is a pipeline powered by Snakemake, a python-based workflow management package. This project has been created with support for PBS-Torque scheduler environments on Linux servers.

Below is a diagram representing the pipeline flow and steps in the form of a process flow diagram. For reference on the graph syntax (Shape legend), please consult this guide.

---
title: Pharmacogenetics Analysis Pipeline
---
flowchart TD
    START([Start])
    END([End - Results])

    subgraph ValidateVcfModule
        validateVcf([Validate VCF\nWorkflow])
    end
    subgraph PopulationStructureModule
        PopulationStructureWorkflow([Populaltion Structure\nWorkflow])
    end


    subgraph dataPrep ["Data and Metadata Preparation"]
        %% Use LR to invert axis set by parent to effectively force relative "TB"
        direction LR
        genomeFasta[/"Reference Genome\nGRCh38 (FASTA)"/]
            subgraph data ["Datasets"]
                datasetFiles1[/"Dataset_1\n(.vcf.gz + .tbi)"/]
                datasetFiles2[/"Dataset_2\n(.vcf.gz + .tbi)"/]
                datasetFiles3[/"Dataset_n...\n(.vcf.gz + .tbi)"/]
            end
            subgraph metadata ["Analysis metadata"]
                %% Use LR to invert axis set by parent to effectively force relative "TB"
                direction TB

                datasetMeta[/"Datasets metadata\n(CSV)"/]
                locationMeta[/"Genomic location\nmetadata (CSV)"/]
                sampleMeta[/"Sample metadata\n(CSV)"/]
                transcriptMeta[/"Transcript metadata\n(CSV)"/]
            end

    end
    START --> dataPrep
    dataPrep --> validateVcf

    validateVcf --> ifMultipleVcfs

    subgraph prep [Process]
        direction BT

        subgraph multipleVcfProtocol [Multiple dataset protocol]
            mergeVcf[Merge VCFs]
        end


        ifMultipleVcfs{If multiple\ndatasets provided} --> |yes| multipleVcfProtocol
        ifMultipleVcfs --> |no| refFromFasta

        refFromFasta[[refFromfasta:\nCheck reference alleles against\nprovided reference genome]]
        
        multipleVcfProtocol --> refFromFasta

        refFromFasta --> chrFilter[[chrFilter:\nFilter out non-standard\nchromosomes]]

        chrFilter --> sampleSubset[[sampleSubset:\nSubset samples to labeled\nsamples in metadata files]]
        
        sampleSubset --> FILTER[[FILTER:\nFilter variants and samples\nwith 100% missingness & prune\noverrelated samples]]

        FILTER --> TRIM_AND_NAME[[TRIM_AND_NAME:\nTrim dataset to genomic\nregions of interest]]

        TRANSPILE_CLUSTERS[[TRANSPILE_CLUSTERS:\nTranspile cluster ownership\nfrom sample cluster assignment\ninto input format]]

        TRIM_AND_NAME & TRANSPILE_CLUSTERS --> PLINK[[PLINK:\nPerform frequency analysis]]
 
    end 
    FILTER --> PopulationStructureWorkflow
    PopulationStructureWorkflow --> END
    PLINK --> END

Table of contents


This work is licensed under a Creative Commons Attribution 4.0 International License.. This project is managed by the Institute for Cellular and Molecular Medicine at the University of Pretoria.