Methodology
A breakdown of the methodology and softwares used in this workflow.
plink2
plink
bcftools
java -jar $PICCARD
Table of contents
Rule Map/Diagram
---
title: Pharmacogenetics Analysis
---
flowchart TD
subgraph pharmacogeneticsWorkflow [Pharmacogenetics Workflow]
direction BT
reportFreq[[reportFreq:\nPerform frequency analysis]]
filterRequestedSamples[[filterRequestedSamples:\nSubset samples to labeled\nsamples in metadata files]]
filterVariantMissingness[[filterVariantMissingness:\nFilter variants with 100%\nmissingness]]
filterSampleMissingness[[filterSampleMissingness:\nFilter samples with 100%\nmissingness]]
refFromFasta[[refFromfasta:\nCheck reference alleles against\nprovided reference genome]]
chrFilter[[chrFilter:\nFilter out non-standard\nchromosomes]]
writeSampleMetadata[[writeSampleMetadata:\nTranspile cluster ownership from\nsample cluster assignment into\ninput format]]
calculateLinkageDisequilibrium[[calculateLinkageDisequilibrium:\nCalculate LD associations]]
filterLinkageDisequilibrium[[filterLinkageDisequilibrium:\nRemove variants in LD]]
calculateIdentityByDescent[[calculateIdentityByDescent:\nCalculate Identity-By-Descent]]
calculateSampleIds[[calculateSampleIds:\nQuery a list of sample IDs\nfrom the input VCF]]
filterSampleRelatedness[[filterSampleRelatedness:\nremove a given list of]]
filterLocations[[filterLocations:\nTrim the dataset to one of\nthe studied regions]]
subgraph multipleVcfProtocol [Multiple dataset protocol]
direction LR
multipleVcfProtocolStart(((Start)))
ifMergeRequired{Is a \nmerge needed?}
mergeDatasets[[mergeDatasets:\nMerge multiple incoming\ndatasets]]
multipleVcfProtocolEnd(((End)))
multipleVcfProtocolStart --> ifMergeRequired
ifMergeRequired --> |yes| mergeDatasets --> multipleVcfProtocolEnd
ifMergeRequired --> |No| multipleVcfProtocolEnd
end
multipleVcfProtocol --> refFromFasta --> chrFilter --> filterRequestedSamples --> filterVariantMissingness --> filterSampleMissingness --> calculateLinkageDisequilibrium
filterSampleMissingness & calculateLinkageDisequilibrium --> filterLinkageDisequilibrium
filterLinkageDisequilibrium --> calculateIdentityByDescent --> calculateSampleIds
filterLinkageDisequilibrium & calculateSampleIds --> filterSampleRelatedness
filterSampleRelatedness --> filterLocations --> reportFreq
writeSampleMetadata --> reportFreq
end
subgraph ValidateVcfWorkflow [Validate VCF Workflow]
wipeInfo[[wipeInfo:\nRemove INFO column for\ncomputational processing\n efficiency]]
normalize[[normalize:\nNormalize all SNPs]]
sort[[sort:\nEnsure correct variant order]]
filter[[filter:\nRemove all variants\nexcept SNPs]]
annotate[[annotate:\nAnnotate VCF against given\nreference VCF such as \n dbSNP, and rename any\nunknown variants.]]
subgraph liftoverProtocol [Liftover]
direction LR
liftoverProtocolStart(((Start)))
liftover[[liftover:\nPerform reference genome\nliftover]]
liftoverProtocolEnd(((End)))
ifLiftoverRequired{Is a\nliftover\nrequired?}
liftoverProtocolStart --> ifLiftoverRequired
ifLiftoverRequired --> |yes| liftover --> liftoverProtocolEnd
ifLiftoverRequired --> |no| liftoverProtocolEnd
end
wipeInfo --> normalize --> sort --> filter --> annotate --> liftoverProtocol
end
subgraph PopulationStructureWorkflow [Population Structure Workflow]
plinkPca[[Plink_PCA:\nPerform a PLINK-2.0 PCA]]
plinkPed[[plinkPed:\nConvert to PLINK-1.9's PED\n format]]
fetchPedLables[[fetchPedLables:\nGenerate Ind2Pop sample annotations\n file]]
Admixture[[Admixture:\nPerform an admixture analysis]]
plinkPed --> fetchPedLables --> Admixture
filterSampleRelatedness --> plinkPca & plinkPed
end
liftoverProtocol --> multipleVcfProtocol
END((Results))
Admixture --> END
plinkPca --> END
reportFreq --> END
calculateIdentityByDescent --> END
calculateLinkageDisequilibrium --> END
VCF Validation Workflow Rules
wipeInfo
flowchart TD
wipeInfo[[wipeInfo:\nRemove INFO column for\ncomputational processing\n efficiency]]
- Function
- To remove the
INFO
andFORMAT
columns on the incoming dataset. This is done to speed up computation time for downstream analysis. - Command
bcftools annotate -x INFO,FORMAT -Oz -o {output.vcf} {input.vcf}
- Parameters
-
-x INFO,FORMAT
- Remove the
INFO
andFORMAT
annotations from the input VCF file. -Oz
- Output format (
-Oz
denotes a BG-Zipped VCF output) -o {output.vcf}
- Output file.
normalize
flowchart TD
normalize[[normalize:\nNormalize all SNPs]]
- Function
- To normalize variant representations within the dataset provided. This involves the following:
- decomposing multi-allelic records
- left-aligning all variants
- right-handed trimming to ensure parsimony
- Command
bcftools norm -m -any -O z -o {output.vcf} < {input.vcf}
- Parameters
-
-m -any
- Decompose multi-allelic entries to bi-allelic entries (
-
) and merge both SNPs and INDELS into single records (any
) -Oz
- Output format (
-Oz
denotes a BG-Zipped VCF output) -o {output.vcf}
- Output file.
sort
flowchart TD
sort[[sort:\nEnsure correct variant order]]
This rule is responsible for sorting variants according to position, relative to the provided reference genome. This is important for downstream analysis which assumes ordered variants.
- Function
- To sort variants according to position, relative to the provided reference genome. This is important for downstream analysis which assumes ordered variants.
- Command
bcftools sort -m {params.memory} -T results/PREP/{wildcards.dataset_name} -O z -o {output.vcf} < {input.vcf}
- Parameters
-
-m {params.memory}
- Provide a RAM memory available to the
bcftools sort
command. -T results/PREP/{wildcards.dataset_name}
- Provide a RAM memory available to the
bcftools sort
command. -Oz
- Output format (
-Oz
denotes a BG-Zipped VCF output) -o {output.vcf}
- Output file.
filter
flowchart TD
filter[[filter:\nRemove all variant types\nexcept SNPs]]
- Function
- To remove all variant types except SNPs
- Command
bcftools view -v snps -f PASS -O z -o {output.vcf} < {input.vcf}
- Parameters
-
-v snps
- Only include SNPs
-f PASS
- Only select variants with
PASS
values. -Oz
- Output format (
-Oz
denotes a BG-Zipped VCF output) -o {output.vcf}
- Output file.
annotate
flowchart TD
annotate[[annotate:\nAnnotate VCF against given\nreference VCF such as \n dbSNP, and rename any unknown\nvariants.]]
- Function
- To annotate the incoming data with variant IDs from the provided
resources/annotations.vcf.gz
, and rename any unknown variants. - Command
bcftools annotate -c ID -a {input.annotations} -O z -o {output.vcf} {input.vcf}
- Parameters
-
-a {input.annotations}
- The VCF file that contains the desired annotations.
-c ID
- Copy the
ID
column from the provided annotation VCF. -I +'%CHROM:%POS|%REF-%FIRST_ALT'
- Name all variants using the provided formula. The
+
indicates that this renaming logic should only be applied to variants which have no name, and is applied after retrieving annotations from the provided annotations VCF.%CHROM
denotes the chromosome,%POS
denotes the base-pair position of this variant,%REF
denotes the reference allele at this location, and%FIRST_ALT
denotes the first allele. Since this VCF file has been normalized and multi-allelic variants have already been decomposed to bi-allelic records, this will correspond to the only available allele for a loci. -O z
- Output format (
-Oz
denotes a BG-Zipped VCF output) -o {output.vcf}
- Output file.
annotateUnknown
flowchart TD
annotateUnknown[[annotateUnknown:\nName all un-annotated variants using \nstandardized naming conventions.]]
- Function
- To name all un-named variants which did not have a matching annotation ID.
- Command
plink --vcf {input.vcf} --set-missing-var-ids @:#\$1-\$2 --new-id-max-allele-len 200 --out {params.output}
- Parameters
-
--vcf {input.vcf}
- File path to the input VCF file via parameter.
--set-missing-var-ids @:#\$1-\$2
- A string which describes a naming scheme to be used when setting all un-named variants
@
denotes the chromosome code,#
denotes the base-pair coordinates,$1
denotes the reference allele and$2
denotes the alternate allele. --new-id-max-allele-len 200
- Sets a maximum allowed variant ID length.
--out {params.output}
- Provide the file name and path for output creation.
liftover
*
flowchart TD
subgraph liftoverProtocol [Liftover]
direction LR
liftoverProtocolStart(((Start)))
liftover[[liftover:\nPerform reference genome\nliftover]]
liftoverProtocolEnd(((End)))
ifLiftoverRequired{Is a\nliftover\nrequired?}
liftoverProtocolStart --> ifLiftoverRequired
ifLiftoverRequired --> |yes| liftover --> liftoverProtocolEnd
ifLiftoverRequired --> |no| liftoverProtocolEnd
end
- Function
- To perform reference-genome version liftovers.
- Command
java -jar $PICARD LiftoverVcf I={input.vcf} O={output.vcf} R={params.ref} C={params.chainFile} REJECT={output.rejected}
- Parameters
-
I={input.vcf}
- Provide the input VCF via parameter.
O={output.vcf}
- Provide the output VCF to be written to via parameter.
R={params.ref}
- Provide the reference genome to be used during LiftOver
C={params.chainFile}
- Provide the chain-file describing the nature of the changes between two reference genome versions.
REJECT={params.chainFile}
- Creates a file containing records which could not be lifted over.
Pharmacogenetics Workflow Rules
mergeDatasets
*
flowchart TD
subgraph multipleVcfProtocol [Multiple dataset protocol]
direction LR
multipleVcfProtocolStart(((Start)))
ifMergeRequired{Is a \nmerge needed?}
mergeDatasets[[mergeDatasets:\nMerge multiple incoming\ndatasets]]
multipleVcfProtocolEnd(((End)))
multipleVcfProtocolStart --> ifMergeRequired
ifMergeRequired --> |yes| mergeDatasets --> multipleVcfProtocolEnd
ifMergeRequired --> |No| multipleVcfProtocolEnd
end
This rule only executes when multiple described datasets are detected. This rule is responsible for merging multiple datasets into a single VCF file, suitable for collective analysis.
- Function
- To perform reference-genome version liftovers.
- Command
bcftools merge -O z -o {output} {input.vcf}
- Parameters
-
-O z
- Output format (
-Oz
denotes a BG-Zipped VCF output) -o {output.vcf}
- Output file.
refFromFasta
flowchart TD
refFromFasta[[refFromfasta:\nCheck reference alleles against\nprovided reference genome]]
- Function
- To check each loci and comparing its listed reference to that provided in the reference genome.
- Command
plink2 --vcf {input.vcf} --fa {params.ref} --ref-from-fa force --allow-extra-chr --export vcf-4.2 bgz --out results/COLLATE/refFromFasta
- Parameters
-
--vcf {input.vcf}
- File path to the input VCF file via parameter.
--fa {params.ref}
- File path to reference genome to be used for comparison.
--ref-from-fa force
- Sets REF allele to provided reference FASTA when possible unambiguously (Does not apply to some INDELS)
--allow-extra-chr
- Permits non-standard chromosome codes in input data
--export vcf-4.2 bgz
- Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/COLLATE/refFromFasta
- Provide the file name and path for output creation.
chrFilter
flowchart TD
chrFilter[[chrFilter:\nFilter out non-standard\nchromosomes]]
- Function
- To filter out non-standard chromosomes.
- Command
plink2 --vcf {input.vcf} --allow-extra-chr --output-chr chr26 --chr 1-26 --export vcf-4.2 bgz --out results/COLLATE/chrFilter
- Parameters
-
--vcf {input.vcf}
- File path to the input VCF file via parameter.
--allow-extra-chr
- Permits non-standard chromosome codes in input data.
--output-chr chr26
- Sets chromosome code notation in output files to include the 'chr' as a prefix.
--chr 1-26
- Request a subset of chromosomes to be included in the output file.
--export vcf-4.2 bgz
- Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/COLLATE/chrFilter
- Provide the file name and path for output creation.
filterRequestedSamples
flowchart TD
filterRequestedSamples[[filterRequestedSamples:\nSubset samples to labeled\nsamples in metadata files]]
- Function
- To remove unneeded samples. This is done by comparison against all provided sample annotations in the
input/samples.csv
metadata file. - Command
bcftools view -s {params.samples} -O z -o {output} {input.vcf}
- Parameters
-
-s {params.samples}
- Provide a list of sample IDs to include in output.
-O z
- Output format (
-Oz
denotes a BG-Zipped VCF output) -o {output}
- Output file.
filterVariantMissingness
flowchart TD
filterVariantMissingness[[filterVariantMissingness:\nFilter variants with 100%\nmissingness]]
- Function
- To manage and remove regions of missing calls along the variant-level.
- Command
plink2 --chr 1-26 --allow-extra-chr --vcf {input} --geno 1.0 --output-chr chr26 --export vcf-4.2 bgz --out {params.output}
- Parameters
-
--chr 1-26
- Request a subset of chromosomes to be included in the output file.
--allow-extra-chr
- Permits non-standard chromosome codes in input data.
--vcf {input.vcf}
- Removes all variants with a missing call rate exceeding
1.0
--geno 1.0
- File path to the input VCF file via parameter.
--output-chr chr26
- Sets chromosome code notation in output files to include the 'chr' as a prefix.
--export vcf-4.2 bgz
- Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out {params.output}
- Provide the file name and path for output creation.
filterSampleMissingness
flowchart TD
filterSampleMissingness[[filterSampleMissingness:\nFilter samples with 100%\nmissingness]]
- Function
- To manage and remove regions of missing calls along the sample-level.
- Command
plink2 --chr 1-26 --allow-extra-chr --vcf {input} --mind 1.0 --output-chr chr26 --export vcf-4.2 bgz --out {params.output}
- Parameters
-
--chr 1-26
- Request a subset of chromosomes to be included in the output file.
--allow-extra-chr
- Permits non-standard chromosome codes in input data.
--vcf {input.vcf}
- File path to the input VCF file via parameter.
--mind 1.0
- Removes all samples with a missing call rate exceeding
1.0
--output-chr chr26
- Sets chromosome code notation in output files to include the 'chr' as a prefix.
--export vcf-4.2 bgz
- Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out {params.output}
- Provide the file name and path for output creation.
calculateLinkageDisequilibrium
flowchart TD
calculateLinkageDisequilibrium[[calculateLinkageDisequilibrium:\nCalculate LD associations]]
- Function
- To calculate and compile a Linkage-Disequilibrium report.
- Command
plink2 --vcf {input} --chr 1-26 --new-id-max-allele-len 1000 --rm-dup exclude-mismatch --indep-pairwise 50 5 0.5 --bad-ld --out {params.output}
- Parameters
-
--vcf {input.vcf}
- File path to the input VCF file via parameter.
--chr 1-26
- Request a subset of chromosomes to be included in the output file.
--new-id-max-allele-len 1000
- Sets a new internal maximum length for variant IDs.
--rm-dup exclude-mismatch
- When duplicate IDs are found, remove all entries.
--indep-pairwise 50 5 0.5
- Only include variants in approximate linkage equilibrium, using a
50
-variant window which moves5
variants per step and removes variants with an r2 value greater than0.5
. --bad-ld
- Overrides warning for less than 50 sample datasets where LD is not accurate.
--out {params.output}
- Provide the file name and path for output creation.
filterLinkageDisequilibrium
flowchart TD
filterLinkageDisequilibrium[[filterLinkageDisequilibrium:\nRemove variants in LD]]
- Function
- To calculate and compile a Linkage-Disequilibrium report.
- Command
plink2 --allow-extra-chr --vcf {input.vcf} --extract {input.inclusion_list} --export vcf-4.2 bgz --out {params.output}
- Parameters
-
--allow-extra-chr
- Permits non-standard chromosome codes in input data.
--vcf {input.vcf}
- File path to the input VCF file via parameter.
--extract {input.inclusion_list}
- Extracts only the listed samples.
--export vcf-4.2 bgz
- Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out {params.output}
- Provide the file name and path for output creation.
calculateIdentityByDescent
flowchart TD
calculateIdentityByDescent[[calculateIdentityByDescent:\nCalculate Identity-By-Descent]]
- Function
- To calculate and compile am Identity-By-Descent report.
- Command
plink --vcf {input} --allow-extra-chr --keep-allele-order --genome --min 0.2 --recode vcf-iid bgz --out {params.output}
- Parameters
-
--vcf {input.vcf}
- File path to the input VCF file via parameter.
--allow-extra-chr
- Permits non-standard chromosome codes in input data.
--keep-allele-order
- Tells Plink-1.9 to maintain original A1/A2, otherwise major allele is set to A2.
--genome
- Run an identity-by-descent report.
--min 0.2
- Removes individuals with an IBD score below
0.2
. --recode vcf-iid bgz
- Sets output format to a BG-ZIpped VCF with individual-identifiers.
--out {params.output}
- Provide the file name and path for output creation.
calculateSampleIds
flowchart TD
calculateSampleIds[[calculateSampleIds:\nQuery a list of sample IDs\nfrom the input VCF]]
- Function
- To generate a list of sample IDs.
- Command
bcftools query -l {input} > {output}
- Parameters
-
-l {input}
- A file containing a list of Sample IDs to keep.
filterSampleRelatedness
flowchart TD
filterSampleRelatedness[[filterSampleRelatedness:\nremove a given list of samples\nbased on IBD results]]
- Function
- To filter out all but unrelated samples, given the list of samples to keep from its predecessor rules.
- Command
bcftools view {input.vcf} -S {input.samples} -O z -o {output}
- Parameters
-
-S {input.vcf}
- A file containing a list of Sample IDs to include.
-O z
- Output format (
-Oz
denotes a BG-Zipped VCF output) -o {output}
- Output file.
filterLocations
flowchart TD
filterLocations[[filterLocations:\nTrim the dataset to one of\nthe studied regions]]
- Function
- To filter out all but unrelated samples, given the list of samples to keep from its predecessor rules.
- Command
plink2 --allow-extra-chr --vcf {input} --from-bp {params.fromBP} --to-bp {params.toBP} --chr {params.chr} --output-chr chr26 --export vcf-4.2 bgz --out results/TRIM/ALL_{wildcards.location}
- Parameters
-
--allow-extra-chr
- Permits non-standard chromosome codes in input data.
--vcf {input.vcf}
- File path to the input VCF file via parameter.
--from-bp {params.fromBP}
- Start coordinates for the region to trim to.
--to-bp {params.toBP}
- Stop coordinates for the region to trim to.
--chr {params.chr}
- The chromosome to trim on.
--output-chr chr26
- Sets chromosome code notation in output files to include the 'chr' as a prefix.
--export vcf-4.2 bgz
- Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/TRIM/ALL_{wildcards.location}
- Provide the file name and path for output creation.
writeSampleMetadata
flowchart TD
writeSampleMetadata[[writeSampleMetadata:\nTranspile cluster ownership from\nsample cluster assignment into\ninput format]]
- Function
- To compile sample metadata.
- Command
join("scripts", "01-TRANSPILE_CLUSTERS.py")
- Parameters
-
input/samples.csv
- The sample information provided that should be used in this analysis.
reportFreq
flowchart TD
reportFreq[[reportFreq:\nPerform frequency analysis]]
- Function
- To generate a frequency report.
- Command
plink2 --allow-extra-chr --vcf {input.vcf} --freq counts --export vcf-4.2 bgz --out results/FINAL/$CLUSTER/{params.prefix}
, andplink2 --allow-extra-chr --vcf {input.vcf} --pheno iid-only results/REFERENCE/cluster_$CLUSTER.txt --loop-cats $CLUSTER --freq counts --missing --hardy midp --out results/FINAL/$CLUSTER/{params.prefix}
- Parameters
-
--allow-extra-chr
- Permits non-standard chromosome codes in input data.
--vcf {input.vcf}
- File path to the input VCF file via parameter.
--freq counts
- Generate a frequency report including variant count data.
--export vcf-4.2 bgz
- Save output to a BG-Zipped VCF file using the VCF-4.2 specification.
--out results/TRIM/ALL_{wildcards.location}
- Provide the file name and path for output creation.
--pheno iid-only results/REFERENCE/cluster_$CLUSTER.txt
- Extract phenotype information from the given file by Sample ID.
--loop-cats $CLUSTER
- Re-run this command and focus on the populated cluster (This command is designed to be executed in a bash for loop, where $CLUSTER is set each iteration).
--missing
- Generate a missingness report for both samples and variants.
--hardy midp
- Generate a Hardy-Weinburg report with Mid-P adjustments.
Population Structure Workflow Rules
plinkPed
flowchart TD
plinkPed[[plinkPed:\nConvert to PLINK-1.9's PED\n format]]
fetchPedLables
flowchart TD
fetchPedLables[[fetchPedLables:\nGenerate Ind2Pop sample annotations\n file]]
Admixture
flowchart TD
Admixture[[Admixture:\nPerform an admixture analysis]]
plinkPca
flowchart TD
plinkPca[[Plink_PCA:\nPerform a PLINK-2.0 PCA]]