Methodology

A breakdown of the process used in this workflow and how it has been implemented.

Reference Genome Configuration

Table of contents

Rule Map/Diagram
---
title: Population Structure Workflow
config:
    flowchart:
        defaultRenderer: elk
---
flowchart TB
  subgraph population_structure_workflow[Population Structure Workflow]
    direction TB
    classDef bcftools stroke:#FF5733,fill:#D3D3D3,stroke-width:4px,color:black;
    classDef plink stroke:#36454F,fill:#D3D3D3,stroke-width:4px,color:black;
    classDef python stroke:#FEBE10,fill:#D3D3D3,stroke-width:4px,color:black;
    classDef admixture stroke:#333,fill:#D3D3D3,stroke-width:4px,color:black;
    classDef tabix stroke:#023020,fill:#D3D3D3,stroke-width:4px,color:black;
    classDef gatk stroke:#007FFF,fill:#D3D3D3,stroke-width:4px,color:black;
    START(((Input)))
    END(((Output)))

    extract_provided_region[[<b>extract_provided_region</b>: <br>Extract the provided region <br>coordinates for clustering]]

    remove_rare_variants[[<b>remove_rare_variants</b>: <br>Remove all variants which are <br>not good indicators of population <br>structure by nature]]

    plink_pca[[<b>plink_pca</b>: <br>Perform a <br>PLINK-2.0 PCA]]
    
    report_fixation_index_per_cluster[[<b>report_fixation_index_per_cluster</b>: <br>Report Fixation-index for the <br>provided clusters]]

    class remove_rare_variants,plink_pca,plinkPed,report_fixation_index_per_cluster,extract_provided_region plink;
    class Admixture admixture;
    class fetchPedLables python;

    START --> extract_provided_region --> remove_rare_variants --> plink_pca & report_fixation_index_per_cluster

    plink_pca & report_fixation_index_per_cluster --> END
  end
extract_provided_region
  flowchart TD
    extract_provided_region[[<b>extract_provided_region</b>: <br>Extract the provided region <br>coordinates for clustering]]

    classDef plink stroke:#36454F,fill:#D3D3D3,stroke-width:4px,color:black;
    class extract_provided_region plink;
Function
Extract the requested coordinates to be used for population clustering, as provided in the sample.csv file.
Command
plink2 --threads {threads} --pfile {params.input} vzs --from-bp {params.fromBP} --to-bp {params.toBP} --chr {params.chr} --make-pgen vzs --out {params.output}
Parameters
--threads {threads}
Used to set the number of CPU threads used during this calculation
--pfile {params.input} vzs
Used to provide plink with the location of a plink-2 binary file set (.psam, .pvar and .pgen files), and to expect z-compressed files.
--from-bp
The start co-ordinates to start trimming from.
--to-bp
The stop coordinates to trim until.
--chr
The chromosome on which the coordinates can be found.
--make-pgen zs
Save output to a BG-Zipped pgen binary fileset.
--out {params.output}
Provide the file name and path for output creation.
remove_rare_variants
  flowchart TD
    remove_rare_variants[[<b>remove_rare_variants</b>: <br>Remove all variants which are <br>not good indicators of population <br>structure by nature]]

    classDef plink stroke:#36454F,fill:#D3D3D3,stroke-width:4px,color:black;
    class remove_rare_variants plink;
Function
Remove singletons as these do not contribute towards an understanding of clusters, since a singleton only serves to separate a sample from a possible cluster.
Command
plink2 --threads {threads} --pfile {params.input} vzs --pheno {input.sample_metadata} --mac 2 --make-pgen vzs --out {params.output}
Parameters
--threads {threads}
Used to set the number of CPU threads used during this calculation.
--pfile {params.input} vzs
Used to provide plink with the location of a plink-2 binary file set (.psam, .pvar and .pgen files), and to expect z-compressed files.
--pheno {input.sample_metadata}
Responsible for annotating samples with provided annotations.
--mac 2
Remove any variants with a total count of less than 2.
--make-pgen zs
Save output to a BG-Zipped pgen binary fileset.
--out {params.output}
Provide the file name and path for output creation.
plink_pca
  flowchart TD
    plink_pca[[<b>plink_pca</b>: <br>Perform a <br>PLINK-2.0 PCA]]

    classDef plink stroke:#36454F,fill:#D3D3D3,stroke-width:4px,color:black;
    class plink_pca plink;
Function
Perform dimensionality reduction on the samples provided and produce allele-weighted scores indicating possible population structure.
Command
plink2 --threads {threads} --pfile {params.input} vzs --pca allele-wts --out {params.output}
Parameters
--threads {threads}
Used to set the number of CPU threads used during this calculation.
--pfile {params.input} vzs
Used to provide plink with the location of a plink-2 binary file set (.psam, .pvar and .pgen files), and to expect z-compressed files.
--pca allele-wts
Generate an allele-weighted PCA eigenvector and eigenvalue files.
--out {params.output}
Provide the file name and path for output creation.
report_fixation_index_per_cluster
  flowchart TD
    report_fixation_index_per_cluster[[<b>report_fixation_index_per_cluster</b>: <br>Report Fixation-index for the <br>provided clusters]]

    classDef plink stroke:#36454F,fill:#D3D3D3,stroke-width:4px,color:black;
    class report_fixation_index_per_cluster plink;
Function
To generate a hardy-weinberg report.
Command
plink2 --threads {threads} --pfile {params.input} vzs --fst {wildcards.cluster} report-variants zs --out {params.output}
Parameters
--threads {threads}
Used to set the number of CPU threads used during this calculation
--pfile {params.input} vzs
Used to provide plink with the location of a plink-2 binary file set (.psam, .pvar and .pgen files), and to expect z-compressed files.
--fst {wildcards.cluster} report-variants zs
Perform the requested fixation index calculations. the report-variants modifier requests variant-level fst results and the zs modifier requests the output to be compressed.
--out {params.output}
Provide the file name and path for output creation.

This work is licensed under a Creative Commons Attribution 4.0 International License.. This project is managed by the Institute for Cellular and Molecular Medicine at the University of Pretoria.