Methodology

A breakdown of the methodology and softwares used in this workflow.

plink2

plink

bcftools

java -jar $PICCARD

Table of contents

VCF Validation Workflow Rules
Pharmacogenetics Workflow Rules
Population Structure Workflow Rules

Rule Map/Diagram

  ---
  title: Pharmacogenetics Analysis
  ---
  flowchart TD
  subgraph pharmacogeneticsWorkflow [Pharmacogenetics Workflow]
      direction BT

      reportFreq[[reportFreq:\nPerform frequency analysis]]
      filterRequestedSamples[[filterRequestedSamples:\nSubset samples to labeled\nsamples in metadata files]]
      filterVariantMissingness[[filterVariantMissingness:\nFilter variants with 100%\nmissingness]]
      filterSampleMissingness[[filterSampleMissingness:\nFilter samples with 100%\nmissingness]]
      refFromFasta[[refFromfasta:\nCheck reference alleles against\nprovided reference genome]]
      chrFilter[[chrFilter:\nFilter out non-standard\nchromosomes]]
      writeSampleMetadata[[writeSampleMetadata:\nTranspile cluster ownership from\nsample cluster assignment into\ninput format]]
      calculateLinkageDisequilibrium[[calculateLinkageDisequilibrium:\nCalculate LD associations]]
      filterLinkageDisequilibrium[[filterLinkageDisequilibrium:\nRemove variants in LD]]
      calculateIdentityByDescent[[calculateIdentityByDescent:\nCalculate Identity-By-Descent]]
      calculateSampleIds[[calculateSampleIds:\nQuery a list of sample IDs\nfrom the input VCF]]
      filterSampleRelatedness[[filterSampleRelatedness:\nremove a given list of]]
      filterLocations[[filterLocations:\nTrim the dataset to one of\nthe studied regions]]


      subgraph multipleVcfProtocol [Multiple dataset protocol]
        direction LR
        multipleVcfProtocolStart(((Start)))
        ifMergeRequired{Is a \nmerge needed?}
        mergeDatasets[[mergeDatasets:\nMerge multiple incoming\ndatasets]]
        multipleVcfProtocolEnd(((End)))

        multipleVcfProtocolStart --> ifMergeRequired
        ifMergeRequired --> |yes| mergeDatasets --> multipleVcfProtocolEnd
        ifMergeRequired --> |No| multipleVcfProtocolEnd
      end

      
      multipleVcfProtocol --> refFromFasta --> chrFilter --> filterRequestedSamples --> filterVariantMissingness --> filterSampleMissingness --> calculateLinkageDisequilibrium
      
      filterSampleMissingness & calculateLinkageDisequilibrium --> filterLinkageDisequilibrium

      filterLinkageDisequilibrium --> calculateIdentityByDescent --> calculateSampleIds

      filterLinkageDisequilibrium & calculateSampleIds --> filterSampleRelatedness

      filterSampleRelatedness --> filterLocations --> reportFreq
      
      

      writeSampleMetadata --> reportFreq

  end 
  subgraph ValidateVcfWorkflow [Validate VCF Workflow]
      wipeInfo[[wipeInfo:\nRemove INFO column for\ncomputational processing\n efficiency]]
      normalize[[normalize:\nNormalize all SNPs]]
      sort[[sort:\nEnsure correct variant order]]
      filter[[filter:\nRemove all variants\nexcept SNPs]]
      annotate[[annotate:\nAnnotate VCF against given\nreference VCF such as \n dbSNP, and rename any\nunknown variants.]]

      subgraph liftoverProtocol [Liftover]
        direction LR
        liftoverProtocolStart(((Start)))
        liftover[[liftover:\nPerform reference genome\nliftover]]
        liftoverProtocolEnd(((End)))
        ifLiftoverRequired{Is a\nliftover\nrequired?}

        liftoverProtocolStart --> ifLiftoverRequired
        ifLiftoverRequired --> |yes| liftover --> liftoverProtocolEnd
        ifLiftoverRequired --> |no| liftoverProtocolEnd
      end

      wipeInfo --> normalize --> sort --> filter --> annotate --> liftoverProtocol
  end
  subgraph PopulationStructureWorkflow [Population Structure Workflow]
      plinkPca[[Plink_PCA:\nPerform a PLINK-2.0 PCA]]
      plinkPed[[plinkPed:\nConvert to PLINK-1.9's PED\n format]]
      fetchPedLables[[fetchPedLables:\nGenerate Ind2Pop sample annotations\n file]]
      Admixture[[Admixture:\nPerform an admixture analysis]]

      plinkPed --> fetchPedLables --> Admixture
      filterSampleRelatedness --> plinkPca & plinkPed

  end

  liftoverProtocol --> multipleVcfProtocol

  END((Results))

  Admixture --> END
  plinkPca --> END
  reportFreq --> END
  calculateIdentityByDescent --> END
  calculateLinkageDisequilibrium --> END

VCF Validation Workflow Rules

wipeInfo

  flowchart TD
    wipeInfo[[wipeInfo:\nRemove INFO column for\ncomputational processing\n efficiency]]

Function

To remove the INFO and FORMAT columns on the incoming dataset. This is done to speed up computation time for downstream analysis.

Command

bcftools annotate -x INFO,FORMAT -Oz -o {output.vcf} {input.vcf}

Parameters

-x INFO,FORMAT: Remove the INFO and FORMAT annotations from the input VCF file.
-Oz: Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}: Output file.

normalize

  flowchart TD
    normalize[[normalize:\nNormalize all SNPs]]

Function

To normalize variant representations within the dataset provided. This involves the following:

decomposing multi-allelic records
left-aligning all variants
right-handed trimming to ensure parsimony

Command

bcftools norm -m -any -O z -o {output.vcf} < {input.vcf}

Parameters

-m -any: Decompose multi-allelic entries to bi-allelic entries (-) and merge both SNPs and INDELS into single records (any)
-Oz: Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}: Output file.

sort

  flowchart TD
    sort[[sort:\nEnsure correct variant order]]

This rule is responsible for sorting variants according to position, relative to the provided reference genome. This is important for downstream analysis which assumes ordered variants.

Function

To sort variants according to position, relative to the provided reference genome. This is important for downstream analysis which assumes ordered variants.

Command

bcftools sort -m {params.memory} -T results/PREP/{wildcards.dataset_name} -O z -o {output.vcf} < {input.vcf}

Parameters

-m {params.memory}: Provide a RAM memory available to the bcftools sort command.
-T results/PREP/{wildcards.dataset_name}: Provide a RAM memory available to the bcftools sort command.
-Oz: Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}: Output file.

filter

  flowchart TD
    filter[[filter:\nRemove all variant types\nexcept SNPs]]

Function

To remove all variant types except SNPs

Command

bcftools view -v snps -f PASS -O z -o {output.vcf} < {input.vcf}

Parameters

-v snps: Only include SNPs
-f PASS: Only select variants with PASS values.
-Oz: Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}: Output file.

annotate

  flowchart TD
    annotate[[annotate:\nAnnotate VCF against given\nreference VCF such as \n dbSNP, and rename any unknown\nvariants.]]

Function

To annotate the incoming data with variant IDs from the provided resources/annotations.vcf.gz, and rename any unknown variants.

Command

bcftools annotate -c ID -a {input.annotations} -O z -o {output.vcf} {input.vcf}

Parameters

-a {input.annotations}: The VCF file that contains the desired annotations.
-c ID: Copy the ID column from the provided annotation VCF.
-I +'%CHROM:%POS|%REF-%FIRST_ALT': Name all variants using the provided formula. The + indicates that this renaming logic should only be applied to variants which have no name, and is applied after retrieving annotations from the provided annotations VCF. %CHROM denotes the chromosome, %POS denotes the base-pair position of this variant, %REF denotes the reference allele at this location, and %FIRST_ALT denotes the first allele. Since this VCF file has been normalized and multi-allelic variants have already been decomposed to bi-allelic records, this will correspond to the only available allele for a loci.
-O z: Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}: Output file.

annotateUnknown

  flowchart TD
    annotateUnknown[[annotateUnknown:\nName all un-annotated variants using \nstandardized naming conventions.]]

Function

To name all un-named variants which did not have a matching annotation ID.

Command

plink --vcf {input.vcf} --set-missing-var-ids @:#\$1-\$2 --new-id-max-allele-len 200 --out {params.output}

Parameters

--vcf {input.vcf}: File path to the input VCF file via parameter.
--set-missing-var-ids @:#\$1-\$2: A string which describes a naming scheme to be used when setting all un-named variants @ denotes the chromosome code, # denotes the base-pair coordinates, $1 denotes the reference allele and $2 denotes the alternate allele.
--new-id-max-allele-len 200: Sets a maximum allowed variant ID length.
--out {params.output}: Provide the file name and path for output creation.

liftover*

  flowchart TD
    subgraph liftoverProtocol [Liftover]
      direction LR
      liftoverProtocolStart(((Start)))
      liftover[[liftover:\nPerform reference genome\nliftover]]
      liftoverProtocolEnd(((End)))
      ifLiftoverRequired{Is a\nliftover\nrequired?}

      liftoverProtocolStart --> ifLiftoverRequired
      ifLiftoverRequired --> |yes| liftover --> liftoverProtocolEnd
      ifLiftoverRequired --> |no| liftoverProtocolEnd
    end

Function

To perform reference-genome version liftovers.

Command

java -jar $PICARD LiftoverVcf I={input.vcf} O={output.vcf} R={params.ref} C={params.chainFile} REJECT={output.rejected}

Parameters

I={input.vcf}: Provide the input VCF via parameter.
O={output.vcf}: Provide the output VCF to be written to via parameter.
R={params.ref}: Provide the reference genome to be used during LiftOver
C={params.chainFile}: Provide the chain-file describing the nature of the changes between two reference genome versions.
REJECT={params.chainFile}: Creates a file containing records which could not be lifted over.

Pharmacogenetics Workflow Rules

mergeDatasets*

  flowchart TD
    subgraph multipleVcfProtocol [Multiple dataset protocol]
      direction LR
      multipleVcfProtocolStart(((Start)))
      ifMergeRequired{Is a \nmerge needed?}
      mergeDatasets[[mergeDatasets:\nMerge multiple incoming\ndatasets]]
      multipleVcfProtocolEnd(((End)))

      multipleVcfProtocolStart --> ifMergeRequired
      ifMergeRequired --> |yes| mergeDatasets --> multipleVcfProtocolEnd
      ifMergeRequired --> |No| multipleVcfProtocolEnd
    end

This rule only executes when multiple described datasets are detected. This rule is responsible for merging multiple datasets into a single VCF file, suitable for collective analysis.

Function

To perform reference-genome version liftovers.

Command

bcftools merge -O z -o {output} {input.vcf}

Parameters

-O z: Output format (-Oz denotes a BG-Zipped VCF output)
-o {output.vcf}: Output file.

refFromFasta

  flowchart TD
    refFromFasta[[refFromfasta:\nCheck reference alleles against\nprovided reference genome]]

Function

To check each loci and comparing its listed reference to that provided in the reference genome.

Command

plink2 --vcf {input.vcf} --fa {params.ref} --ref-from-fa force --allow-extra-chr --export vcf-4.2 bgz --out results/COLLATE/refFromFasta

Parameters

--vcf {input.vcf}: File path to the input VCF file via parameter.
--fa {params.ref}: File path to reference genome to be used for comparison.
--ref-from-fa force: Sets REF allele to provided reference FASTA when possible unambiguously (Does not apply to some INDELS)
--allow-extra-chr: Permits non-standard chromosome codes in input data
--export vcf-4.2 bgz: Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/COLLATE/refFromFasta: Provide the file name and path for output creation.

chrFilter

  flowchart TD
    chrFilter[[chrFilter:\nFilter out non-standard\nchromosomes]]

Function

To filter out non-standard chromosomes.

Command

plink2 --vcf {input.vcf} --allow-extra-chr --output-chr chr26 --chr 1-26 --export vcf-4.2 bgz --out results/COLLATE/chrFilter

Parameters

--vcf {input.vcf}: File path to the input VCF file via parameter.
--allow-extra-chr: Permits non-standard chromosome codes in input data.
--output-chr chr26: Sets chromosome code notation in output files to include the 'chr' as a prefix.
--chr 1-26: Request a subset of chromosomes to be included in the output file.
--export vcf-4.2 bgz: Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/COLLATE/chrFilter: Provide the file name and path for output creation.

filterRequestedSamples

  flowchart TD
    filterRequestedSamples[[filterRequestedSamples:\nSubset samples to labeled\nsamples in metadata files]]

Function

To remove unneeded samples. This is done by comparison against all provided sample annotations in the input/samples.csv metadata file.

Command

bcftools view -s {params.samples} -O z -o {output} {input.vcf}

Parameters

-s {params.samples}: Provide a list of sample IDs to include in output.
-O z: Output format (-Oz denotes a BG-Zipped VCF output)
-o {output}: Output file.

filterVariantMissingness

  flowchart TD
    filterVariantMissingness[[filterVariantMissingness:\nFilter variants with 100%\nmissingness]]

Function

To manage and remove regions of missing calls along the variant-level.

Command

plink2 --chr 1-26 --allow-extra-chr --vcf {input} --geno 1.0 --output-chr chr26 --export vcf-4.2 bgz --out {params.output}

Parameters

--chr 1-26: Request a subset of chromosomes to be included in the output file.
--allow-extra-chr: Permits non-standard chromosome codes in input data.
--vcf {input.vcf}: Removes all variants with a missing call rate exceeding 1.0
--geno 1.0: File path to the input VCF file via parameter.
--output-chr chr26: Sets chromosome code notation in output files to include the 'chr' as a prefix.
--export vcf-4.2 bgz: Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out {params.output}: Provide the file name and path for output creation.

filterSampleMissingness

  flowchart TD
    filterSampleMissingness[[filterSampleMissingness:\nFilter samples with 100%\nmissingness]]

Function

To manage and remove regions of missing calls along the sample-level.

Command

plink2 --chr 1-26 --allow-extra-chr --vcf {input} --mind 1.0 --output-chr chr26 --export vcf-4.2 bgz --out {params.output}

Parameters

--chr 1-26: Request a subset of chromosomes to be included in the output file.
--allow-extra-chr: Permits non-standard chromosome codes in input data.
--vcf {input.vcf}: File path to the input VCF file via parameter.
--mind 1.0: Removes all samples with a missing call rate exceeding 1.0
--output-chr chr26: Sets chromosome code notation in output files to include the 'chr' as a prefix.
--export vcf-4.2 bgz: Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out {params.output}: Provide the file name and path for output creation.

calculateLinkageDisequilibrium

  flowchart TD
    calculateLinkageDisequilibrium[[calculateLinkageDisequilibrium:\nCalculate LD associations]]

Function

To calculate and compile a Linkage-Disequilibrium report.

Command

plink2 --vcf {input} --chr 1-26 --new-id-max-allele-len 1000 --rm-dup exclude-mismatch --indep-pairwise 50 5 0.5 --bad-ld --out {params.output}

Parameters

--vcf {input.vcf}: File path to the input VCF file via parameter.
--chr 1-26: Request a subset of chromosomes to be included in the output file.
--new-id-max-allele-len 1000: Sets a new internal maximum length for variant IDs.
--rm-dup exclude-mismatch: When duplicate IDs are found, remove all entries.
--indep-pairwise 50 5 0.5: Only include variants in approximate linkage equilibrium, using a 50-variant window which moves 5 variants per step and removes variants with an r₂ value greater than 0.5.
--bad-ld: Overrides warning for less than 50 sample datasets where LD is not accurate.
--out {params.output}: Provide the file name and path for output creation.

filterLinkageDisequilibrium

  flowchart TD
    filterLinkageDisequilibrium[[filterLinkageDisequilibrium:\nRemove variants in LD]]

Function

To calculate and compile a Linkage-Disequilibrium report.

Command

plink2 --allow-extra-chr --vcf {input.vcf} --extract {input.inclusion_list} --export vcf-4.2 bgz --out {params.output}

Parameters

--allow-extra-chr: Permits non-standard chromosome codes in input data.
--vcf {input.vcf}: File path to the input VCF file via parameter.
--extract {input.inclusion_list}: Extracts only the listed samples.
--export vcf-4.2 bgz: Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out {params.output}: Provide the file name and path for output creation.

calculateIdentityByDescent

  flowchart TD
    calculateIdentityByDescent[[calculateIdentityByDescent:\nCalculate Identity-By-Descent]]

Function

To calculate and compile am Identity-By-Descent report.

Command

plink --vcf {input} --allow-extra-chr --keep-allele-order --genome --min 0.2 --recode vcf-iid bgz --out {params.output}

Parameters

--vcf {input.vcf}: File path to the input VCF file via parameter.
--allow-extra-chr: Permits non-standard chromosome codes in input data.
--keep-allele-order: Tells Plink-1.9 to maintain original A1/A2, otherwise major allele is set to A2.
--genome: Run an identity-by-descent report.
--min 0.2: Removes individuals with an IBD score below 0.2.
--recode vcf-iid bgz: Sets output format to a BG-ZIpped VCF with individual-identifiers.
--out {params.output}: Provide the file name and path for output creation.

calculateSampleIds

  flowchart TD
    calculateSampleIds[[calculateSampleIds:\nQuery a list of sample IDs\nfrom the input VCF]]

Function

To generate a list of sample IDs.

Command

bcftools query -l {input} > {output}

Parameters

-l {input}: A file containing a list of Sample IDs to keep.

filterSampleRelatedness

  flowchart TD
    filterSampleRelatedness[[filterSampleRelatedness:\nremove a given list of samples\nbased on IBD results]]

Function

To filter out all but unrelated samples, given the list of samples to keep from its predecessor rules.

Command

bcftools view {input.vcf} -S {input.samples} -O z -o {output}

Parameters

-S {input.vcf}: A file containing a list of Sample IDs to include.
-O z: Output format (-Oz denotes a BG-Zipped VCF output)
-o {output}: Output file.

filterLocations

  flowchart TD
    filterLocations[[filterLocations:\nTrim the dataset to one of\nthe studied regions]]

Function

To filter out all but unrelated samples, given the list of samples to keep from its predecessor rules.

Command

plink2 --allow-extra-chr --vcf {input} --from-bp {params.fromBP} --to-bp {params.toBP} --chr {params.chr} --output-chr chr26 --export vcf-4.2 bgz --out results/TRIM/ALL_{wildcards.location}

Parameters

--allow-extra-chr: Permits non-standard chromosome codes in input data.
--vcf {input.vcf}: File path to the input VCF file via parameter.
--from-bp {params.fromBP}: Start coordinates for the region to trim to.
--to-bp {params.toBP}: Stop coordinates for the region to trim to.
--chr {params.chr}: The chromosome to trim on.
--output-chr chr26: Sets chromosome code notation in output files to include the 'chr' as a prefix.
--export vcf-4.2 bgz: Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/TRIM/ALL_{wildcards.location}: Provide the file name and path for output creation.

writeSampleMetadata

  flowchart TD
    writeSampleMetadata[[writeSampleMetadata:\nTranspile cluster ownership from\nsample cluster assignment into\ninput format]]

Function

To compile sample metadata.

Command

join("scripts", "01-TRANSPILE_CLUSTERS.py")

Parameters

input/samples.csv: The sample information provided that should be used in this analysis.

reportFreq

  flowchart TD
    reportFreq[[reportFreq:\nPerform frequency analysis]]

Function

To generate a frequency report.

Command

plink2 --allow-extra-chr --vcf {input.vcf} --freq counts --export vcf-4.2 bgz --out results/FINAL/$CLUSTER/{params.prefix}

, and

plink2 --allow-extra-chr --vcf {input.vcf} --pheno iid-only results/REFERENCE/cluster_$CLUSTER.txt --loop-cats $CLUSTER --freq counts --missing --hardy midp --out results/FINAL/$CLUSTER/{params.prefix}

Parameters

--allow-extra-chr: Permits non-standard chromosome codes in input data.
--vcf {input.vcf}: File path to the input VCF file via parameter.
--freq counts: Generate a frequency report including variant count data.
--export vcf-4.2 bgz: Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/TRIM/ALL_{wildcards.location}: Provide the file name and path for output creation.
--pheno iid-only results/REFERENCE/cluster_$CLUSTER.txt: Extract phenotype information from the given file by Sample ID.
--loop-cats $CLUSTER: Re-run this command and focus on the populated cluster (This command is designed to be executed in a bash for loop, where $CLUSTER is set each iteration).
--missing: Generate a missingness report for both samples and variants.
--hardy midp: Generate a Hardy-Weinburg report with Mid-P adjustments.

Population Structure Workflow Rules

plinkPed

  flowchart TD
    plinkPed[[plinkPed:\nConvert to PLINK-1.9's PED\n format]]

fetchPedLables

  flowchart TD
    fetchPedLables[[fetchPedLables:\nGenerate Ind2Pop sample annotations\n file]]

Admixture

  flowchart TD
    Admixture[[Admixture:\nPerform an admixture analysis]]

plinkPca

  flowchart TD
    plinkPca[[Plink_PCA:\nPerform a PLINK-2.0 PCA]]