Methodology

A breakdown of the methodology and softwares used in this workflow.

plink2

plink

bcftools

java -jar $PICCARD

Table of contents
  1. VCF Validation Workflow Rules
  2. Pharmacogenetics Workflow Rules
  3. Population Structure Workflow Rules

Rule Map/Diagram
  ---
  title: Pharmacogenetics Analysis
  ---
  flowchart TD
  subgraph pharmacogeneticsWorkflow [Pharmacogenetics Workflow]
      direction BT

      reportFreq[[reportFreq:\nPerform frequency analysis]]
      filterRequestedSamples[[filterRequestedSamples:\nSubset samples to labeled\nsamples in metadata files]]
      filterVariantMissingness[[filterVariantMissingness:\nFilter variants with 100%\nmissingness]]
      filterSampleMissingness[[filterSampleMissingness:\nFilter samples with 100%\nmissingness]]
      refFromFasta[[refFromfasta:\nCheck reference alleles against\nprovided reference genome]]
      chrFilter[[chrFilter:\nFilter out non-standard\nchromosomes]]
      writeSampleMetadata[[writeSampleMetadata:\nTranspile cluster ownership from\nsample cluster assignment into\ninput format]]
      calculateLinkageDisequilibrium[[calculateLinkageDisequilibrium:\nCalculate LD associations]]
      filterLinkageDisequilibrium[[filterLinkageDisequilibrium:\nRemove variants in LD]]
      calculateIdentityByDescent[[calculateIdentityByDescent:\nCalculate Identity-By-Descent]]
      calculateSampleIds[[calculateSampleIds:\nQuery a list of sample IDs\nfrom the input VCF]]
      filterSampleRelatedness[[filterSampleRelatedness:\nremove a given list of]]
      filterLocations[[filterLocations:\nTrim the dataset to one of\nthe studied regions]]


      subgraph multipleVcfProtocol [Multiple dataset protocol]
        direction LR
        multipleVcfProtocolStart(((Start)))
        ifMergeRequired{Is a \nmerge needed?}
        mergeDatasets[[mergeDatasets:\nMerge multiple incoming\ndatasets]]
        multipleVcfProtocolEnd(((End)))

        multipleVcfProtocolStart --> ifMergeRequired
        ifMergeRequired --> |yes| mergeDatasets --> multipleVcfProtocolEnd
        ifMergeRequired --> |No| multipleVcfProtocolEnd
      end

      
      multipleVcfProtocol --> refFromFasta --> chrFilter --> filterRequestedSamples --> filterVariantMissingness --> filterSampleMissingness --> calculateLinkageDisequilibrium
      
      filterSampleMissingness & calculateLinkageDisequilibrium --> filterLinkageDisequilibrium

      filterLinkageDisequilibrium --> calculateIdentityByDescent --> calculateSampleIds

      filterLinkageDisequilibrium & calculateSampleIds --> filterSampleRelatedness

      filterSampleRelatedness --> filterLocations --> reportFreq
      
      

      writeSampleMetadata --> reportFreq

  end 
  subgraph ValidateVcfWorkflow [Validate VCF Workflow]
      wipeInfo[[wipeInfo:\nRemove INFO column for\ncomputational processing\n efficiency]]
      normalize[[normalize:\nNormalize all SNPs]]
      sort[[sort:\nEnsure correct variant order]]
      filter[[filter:\nRemove all variants\nexcept SNPs]]
      annotate[[annotate:\nAnnotate VCF against given\nreference VCF such as \n dbSNP, and rename any\nunknown variants.]]

      subgraph liftoverProtocol [Liftover]
        direction LR
        liftoverProtocolStart(((Start)))
        liftover[[liftover:\nPerform reference genome\nliftover]]
        liftoverProtocolEnd(((End)))
        ifLiftoverRequired{Is a\nliftover\nrequired?}

        liftoverProtocolStart --> ifLiftoverRequired
        ifLiftoverRequired --> |yes| liftover --> liftoverProtocolEnd
        ifLiftoverRequired --> |no| liftoverProtocolEnd
      end

      wipeInfo --> normalize --> sort --> filter --> annotate --> liftoverProtocol
  end
  subgraph PopulationStructureWorkflow [Population Structure Workflow]
      plinkPca[[Plink_PCA:\nPerform a PLINK-2.0 PCA]]
      plinkPed[[plinkPed:\nConvert to PLINK-1.9's PED\n format]]
      fetchPedLables[[fetchPedLables:\nGenerate Ind2Pop sample annotations\n file]]
      Admixture[[Admixture:\nPerform an admixture analysis]]

      plinkPed --> fetchPedLables --> Admixture
      filterSampleRelatedness --> plinkPca & plinkPed

  end

  liftoverProtocol --> multipleVcfProtocol

  END((Results))

  Admixture --> END
  plinkPca --> END
  reportFreq --> END
  calculateIdentityByDescent --> END
  calculateLinkageDisequilibrium --> END

VCF Validation Workflow Rules

wipeInfo
  flowchart TD
    wipeInfo[[wipeInfo:\nRemove INFO column for\ncomputational processing\n efficiency]]
Function
To remove the INFO and FORMAT columns on the incoming dataset. This is done to speed up computation time for downstream analysis.
Command
bcftools annotate -x INFO,FORMAT -Oz -o {output.vcf} {input.vcf}
Parameters
-x INFO,FORMAT
Remove the INFO and FORMAT annotations from the input VCF file.
-Oz
Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}
Output file.
normalize
  flowchart TD
    normalize[[normalize:\nNormalize all SNPs]]
Function
To normalize variant representations within the dataset provided. This involves the following:
  • decomposing multi-allelic records
  • left-aligning all variants
  • right-handed trimming to ensure parsimony
Command
bcftools norm -m -any -O z -o {output.vcf} < {input.vcf}
Parameters
-m -any
Decompose multi-allelic entries to bi-allelic entries (-) and merge both SNPs and INDELS into single records (any)
-Oz
Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}
Output file.
sort
  flowchart TD
    sort[[sort:\nEnsure correct variant order]]

This rule is responsible for sorting variants according to position, relative to the provided reference genome. This is important for downstream analysis which assumes ordered variants.

Function
To sort variants according to position, relative to the provided reference genome. This is important for downstream analysis which assumes ordered variants.
Command
bcftools sort -m {params.memory} -T results/PREP/{wildcards.dataset_name} -O z -o {output.vcf} < {input.vcf}
Parameters
-m {params.memory}
Provide a RAM memory available to the bcftools sort command.
-T results/PREP/{wildcards.dataset_name}
Provide a RAM memory available to the bcftools sort command.
-Oz
Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}
Output file.
filter
  flowchart TD
    filter[[filter:\nRemove all variant types\nexcept SNPs]]
Function
To remove all variant types except SNPs
Command
bcftools view -v snps -f PASS -O z -o {output.vcf} < {input.vcf}
Parameters
-v snps
Only include SNPs
-f PASS
Only select variants with PASS values.
-Oz
Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}
Output file.
annotate
  flowchart TD
    annotate[[annotate:\nAnnotate VCF against given\nreference VCF such as \n dbSNP, and rename any unknown\nvariants.]]
Function
To annotate the incoming data with variant IDs from the provided resources/annotations.vcf.gz, and rename any unknown variants.
Command
bcftools annotate -c ID -a {input.annotations} -O z -o {output.vcf} {input.vcf}
Parameters
-a {input.annotations}
The VCF file that contains the desired annotations.
-c ID
Copy the ID column from the provided annotation VCF.
-I +'%CHROM:%POS|%REF-%FIRST_ALT'
Name all variants using the provided formula. The + indicates that this renaming logic should only be applied to variants which have no name, and is applied after retrieving annotations from the provided annotations VCF. %CHROM denotes the chromosome, %POS denotes the base-pair position of this variant, %REF denotes the reference allele at this location, and %FIRST_ALT denotes the first allele. Since this VCF file has been normalized and multi-allelic variants have already been decomposed to bi-allelic records, this will correspond to the only available allele for a loci.
-O z
Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}
Output file.
annotateUnknown
  flowchart TD
    annotateUnknown[[annotateUnknown:\nName all un-annotated variants using \nstandardized naming conventions.]]
Function
To name all un-named variants which did not have a matching annotation ID.
Command
plink --vcf {input.vcf} --set-missing-var-ids @:#\$1-\$2 --new-id-max-allele-len 200 --out {params.output}
Parameters
--vcf {input.vcf}
File path to the input VCF file via parameter.
--set-missing-var-ids @:#\$1-\$2
A string which describes a naming scheme to be used when setting all un-named variants @ denotes the chromosome code, # denotes the base-pair coordinates, $1 denotes the reference allele and $2 denotes the alternate allele.
--new-id-max-allele-len 200
Sets a maximum allowed variant ID length.
--out {params.output}
Provide the file name and path for output creation.
liftover*
  flowchart TD
    subgraph liftoverProtocol [Liftover]
      direction LR
      liftoverProtocolStart(((Start)))
      liftover[[liftover:\nPerform reference genome\nliftover]]
      liftoverProtocolEnd(((End)))
      ifLiftoverRequired{Is a\nliftover\nrequired?}

      liftoverProtocolStart --> ifLiftoverRequired
      ifLiftoverRequired --> |yes| liftover --> liftoverProtocolEnd
      ifLiftoverRequired --> |no| liftoverProtocolEnd
    end
Function
To perform reference-genome version liftovers.
Command
java -jar $PICARD LiftoverVcf I={input.vcf} O={output.vcf} R={params.ref} C={params.chainFile} REJECT={output.rejected}
Parameters
I={input.vcf}
Provide the input VCF via parameter.
O={output.vcf}
Provide the output VCF to be written to via parameter.
R={params.ref}
Provide the reference genome to be used during LiftOver
C={params.chainFile}
Provide the chain-file describing the nature of the changes between two reference genome versions.
REJECT={params.chainFile}
Creates a file containing records which could not be lifted over.

Pharmacogenetics Workflow Rules

mergeDatasets*
  flowchart TD
    subgraph multipleVcfProtocol [Multiple dataset protocol]
      direction LR
      multipleVcfProtocolStart(((Start)))
      ifMergeRequired{Is a \nmerge needed?}
      mergeDatasets[[mergeDatasets:\nMerge multiple incoming\ndatasets]]
      multipleVcfProtocolEnd(((End)))

      multipleVcfProtocolStart --> ifMergeRequired
      ifMergeRequired --> |yes| mergeDatasets --> multipleVcfProtocolEnd
      ifMergeRequired --> |No| multipleVcfProtocolEnd
    end

This rule only executes when multiple described datasets are detected. This rule is responsible for merging multiple datasets into a single VCF file, suitable for collective analysis.

Function
To perform reference-genome version liftovers.
Command
bcftools merge -O z -o {output} {input.vcf}
Parameters
-O z
Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}
Output file.
refFromFasta
  flowchart TD
    refFromFasta[[refFromfasta:\nCheck reference alleles against\nprovided reference genome]]
Function
To check each loci and comparing its listed reference to that provided in the reference genome.
Command
plink2 --vcf {input.vcf} --fa {params.ref} --ref-from-fa force --allow-extra-chr --export vcf-4.2 bgz --out results/COLLATE/refFromFasta
Parameters
--vcf {input.vcf}
File path to the input VCF file via parameter.
--fa {params.ref}
File path to reference genome to be used for comparison.
--ref-from-fa force
Sets REF allele to provided reference FASTA when possible unambiguously (Does not apply to some INDELS)
--allow-extra-chr
Permits non-standard chromosome codes in input data
--export vcf-4.2 bgz
Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/COLLATE/refFromFasta
Provide the file name and path for output creation.
chrFilter
  flowchart TD
    chrFilter[[chrFilter:\nFilter out non-standard\nchromosomes]]
Function
To filter out non-standard chromosomes.
Command
plink2 --vcf {input.vcf} --allow-extra-chr --output-chr chr26 --chr 1-26 --export vcf-4.2 bgz --out results/COLLATE/chrFilter
Parameters
--vcf {input.vcf}
File path to the input VCF file via parameter.
--allow-extra-chr
Permits non-standard chromosome codes in input data.
--output-chr chr26
Sets chromosome code notation in output files to include the 'chr' as a prefix.
--chr 1-26
Request a subset of chromosomes to be included in the output file.
--export vcf-4.2 bgz
Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/COLLATE/chrFilter
Provide the file name and path for output creation.
filterRequestedSamples
  flowchart TD
    filterRequestedSamples[[filterRequestedSamples:\nSubset samples to labeled\nsamples in metadata files]]
Function
To remove unneeded samples. This is done by comparison against all provided sample annotations in the input/samples.csv metadata file.
Command
bcftools view -s {params.samples} -O z -o {output} {input.vcf}
Parameters
-s {params.samples}
Provide a list of sample IDs to include in output.
-O z
Output format (-Oz denotes a BG-Zipped VCF output)
-o {output}
Output file.
filterVariantMissingness
  flowchart TD
    filterVariantMissingness[[filterVariantMissingness:\nFilter variants with 100%\nmissingness]]
Function
To manage and remove regions of missing calls along the variant-level.
Command
plink2 --chr 1-26 --allow-extra-chr --vcf {input} --geno 1.0 --output-chr chr26 --export vcf-4.2 bgz --out {params.output}
Parameters
--chr 1-26
Request a subset of chromosomes to be included in the output file.
--allow-extra-chr
Permits non-standard chromosome codes in input data.
--vcf {input.vcf}
Removes all variants with a missing call rate exceeding 1.0
--geno 1.0
File path to the input VCF file via parameter.
--output-chr chr26
Sets chromosome code notation in output files to include the 'chr' as a prefix.
--export vcf-4.2 bgz
Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out {params.output}
Provide the file name and path for output creation.
filterSampleMissingness
  flowchart TD
    filterSampleMissingness[[filterSampleMissingness:\nFilter samples with 100%\nmissingness]]
Function
To manage and remove regions of missing calls along the sample-level.
Command
plink2 --chr 1-26 --allow-extra-chr --vcf {input} --mind 1.0 --output-chr chr26 --export vcf-4.2 bgz --out {params.output}
Parameters
--chr 1-26
Request a subset of chromosomes to be included in the output file.
--allow-extra-chr
Permits non-standard chromosome codes in input data.
--vcf {input.vcf}
File path to the input VCF file via parameter.
--mind 1.0
Removes all samples with a missing call rate exceeding 1.0
--output-chr chr26
Sets chromosome code notation in output files to include the 'chr' as a prefix.
--export vcf-4.2 bgz
Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out {params.output}
Provide the file name and path for output creation.
calculateLinkageDisequilibrium
  flowchart TD
    calculateLinkageDisequilibrium[[calculateLinkageDisequilibrium:\nCalculate LD associations]]
Function
To calculate and compile a Linkage-Disequilibrium report.
Command
plink2 --vcf {input} --chr 1-26 --new-id-max-allele-len 1000 --rm-dup exclude-mismatch --indep-pairwise 50 5 0.5 --bad-ld --out {params.output}
Parameters
--vcf {input.vcf}
File path to the input VCF file via parameter.
--chr 1-26
Request a subset of chromosomes to be included in the output file.
--new-id-max-allele-len 1000
Sets a new internal maximum length for variant IDs.
--rm-dup exclude-mismatch
When duplicate IDs are found, remove all entries.
--indep-pairwise 50 5 0.5
Only include variants in approximate linkage equilibrium, using a 50-variant window which moves 5 variants per step and removes variants with an r2 value greater than 0.5.
--bad-ld
Overrides warning for less than 50 sample datasets where LD is not accurate.
--out {params.output}
Provide the file name and path for output creation.
filterLinkageDisequilibrium
  flowchart TD
    filterLinkageDisequilibrium[[filterLinkageDisequilibrium:\nRemove variants in LD]]
Function
To calculate and compile a Linkage-Disequilibrium report.
Command
plink2 --allow-extra-chr --vcf {input.vcf} --extract {input.inclusion_list} --export vcf-4.2 bgz --out {params.output}
Parameters
--allow-extra-chr
Permits non-standard chromosome codes in input data.
--vcf {input.vcf}
File path to the input VCF file via parameter.
--extract {input.inclusion_list}
Extracts only the listed samples.
--export vcf-4.2 bgz
Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out {params.output}
Provide the file name and path for output creation.
calculateIdentityByDescent
  flowchart TD
    calculateIdentityByDescent[[calculateIdentityByDescent:\nCalculate Identity-By-Descent]]
Function
To calculate and compile am Identity-By-Descent report.
Command
plink --vcf {input} --allow-extra-chr --keep-allele-order --genome --min 0.2 --recode vcf-iid bgz --out {params.output}
Parameters
--vcf {input.vcf}
File path to the input VCF file via parameter.
--allow-extra-chr
Permits non-standard chromosome codes in input data.
--keep-allele-order
Tells Plink-1.9 to maintain original A1/A2, otherwise major allele is set to A2.
--genome
Run an identity-by-descent report.
--min 0.2
Removes individuals with an IBD score below 0.2.
--recode vcf-iid bgz
Sets output format to a BG-ZIpped VCF with individual-identifiers.
--out {params.output}
Provide the file name and path for output creation.
calculateSampleIds
  flowchart TD
    calculateSampleIds[[calculateSampleIds:\nQuery a list of sample IDs\nfrom the input VCF]]
Function
To generate a list of sample IDs.
Command
bcftools query -l {input} > {output}
Parameters
-l {input}
A file containing a list of Sample IDs to keep.
filterSampleRelatedness
  flowchart TD
    filterSampleRelatedness[[filterSampleRelatedness:\nremove a given list of samples\nbased on IBD results]]
Function
To filter out all but unrelated samples, given the list of samples to keep from its predecessor rules.
Command
bcftools view {input.vcf} -S {input.samples} -O z -o {output}
Parameters
-S {input.vcf}
A file containing a list of Sample IDs to include.
-O z
Output format (-Oz denotes a BG-Zipped VCF output)
-o {output}
Output file.
filterLocations
  flowchart TD
    filterLocations[[filterLocations:\nTrim the dataset to one of\nthe studied regions]]
Function
To filter out all but unrelated samples, given the list of samples to keep from its predecessor rules.
Command
plink2 --allow-extra-chr --vcf {input} --from-bp {params.fromBP} --to-bp {params.toBP} --chr {params.chr} --output-chr chr26 --export vcf-4.2 bgz --out results/TRIM/ALL_{wildcards.location}
Parameters
--allow-extra-chr
Permits non-standard chromosome codes in input data.
--vcf {input.vcf}
File path to the input VCF file via parameter.
--from-bp {params.fromBP}
Start coordinates for the region to trim to.
--to-bp {params.toBP}
Stop coordinates for the region to trim to.
--chr {params.chr}
The chromosome to trim on.
--output-chr chr26
Sets chromosome code notation in output files to include the 'chr' as a prefix.
--export vcf-4.2 bgz
Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/TRIM/ALL_{wildcards.location}
Provide the file name and path for output creation.
writeSampleMetadata
  flowchart TD
    writeSampleMetadata[[writeSampleMetadata:\nTranspile cluster ownership from\nsample cluster assignment into\ninput format]]
Function
To compile sample metadata.
Command
join("scripts", "01-TRANSPILE_CLUSTERS.py")
Parameters
input/samples.csv
The sample information provided that should be used in this analysis.
reportFreq
  flowchart TD
    reportFreq[[reportFreq:\nPerform frequency analysis]]
Function
To generate a frequency report.
Command
plink2 --allow-extra-chr --vcf {input.vcf} --freq counts --export vcf-4.2 bgz --out results/FINAL/$CLUSTER/{params.prefix}, and plink2 --allow-extra-chr --vcf {input.vcf} --pheno iid-only results/REFERENCE/cluster_$CLUSTER.txt --loop-cats $CLUSTER --freq counts --missing --hardy midp --out results/FINAL/$CLUSTER/{params.prefix}
Parameters
--allow-extra-chr
Permits non-standard chromosome codes in input data.
--vcf {input.vcf}
File path to the input VCF file via parameter.
--freq counts
Generate a frequency report including variant count data.
--export vcf-4.2 bgz
Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/TRIM/ALL_{wildcards.location}
Provide the file name and path for output creation.
--pheno iid-only results/REFERENCE/cluster_$CLUSTER.txt
Extract phenotype information from the given file by Sample ID.
--loop-cats $CLUSTER
Re-run this command and focus on the populated cluster (This command is designed to be executed in a bash for loop, where $CLUSTER is set each iteration).
--missing
Generate a missingness report for both samples and variants.
--hardy midp
Generate a Hardy-Weinburg report with Mid-P adjustments.

Population Structure Workflow Rules

plinkPed
  flowchart TD
    plinkPed[[plinkPed:\nConvert to PLINK-1.9's PED\n format]]
fetchPedLables
  flowchart TD
    fetchPedLables[[fetchPedLables:\nGenerate Ind2Pop sample annotations\n file]]
Admixture
  flowchart TD
    Admixture[[Admixture:\nPerform an admixture analysis]]
plinkPca
  flowchart TD
    plinkPca[[Plink_PCA:\nPerform a PLINK-2.0 PCA]]

This work is licensed under a Creative Commons Attribution 4.0 International License.. This project is managed by the Institute for Cellular and Molecular Medicine at the University of Pretoria.