Data

A summary of the required data and input files needed to perform an analysis.

Table of contents
  1. Overview
  2. Input Data
    1. Compression and Indexing
  3. Analysis configuration
    1. Metadata
      1. datasets.csv Metadata

This page lists the information needed to run the VCF Validation Workflow. Below we guide users through the system used to declare an analysis manifest, and all associated metadata files. For more information, please consult the relevant section below which contains more specific guidance, discussions and technical documentation.

Overview

To perform an analysis with this workflow, users will need to configure the workflow. This includes providing environment-related information like output locations, as well as analysis settings like reference population selection. This information is all declared and stored using the config/manifest.json file.

Input Data Infographic
---
title: Input filemap
config:
    flowchart:
        defaultRenderer: elk
    elk:
        nodePlacementStrategy: BRANDES_KOEPF
---
flowchart TB
  subgraph input [Input Files]
      subgraph data [Datasets]
          datasetFile1
          datasetFile2
          datasetFileN
      end

      subgraph metadata [Analysis Metadata]
          datasetMeta
      end
  end
  subgraph config [Configuration]
    configuration
  end

  vcf_validation_workflow[\VCF Validation Workflow/]
  click vcf_validation_workflow href "https://tuks-icmm.github.io/VCF-Validation-Workflow/workflow/methodology" _blank

  metadata -.-o|Describes| data

  config --> vcf_validation_workflow
  input --> vcf_validation_workflow


  vcf_validation_workflow --> results

  results(((Results)))

Input Data

This workflow is designed to work on variant-call-format files (.vcf file extension). The latest version of the VCF specification can be found here.

Compression and Indexing

This workflow can accept uncompressed VCF files, however this workflow will compress and index the data during handling for performance reasons. If possible, please provide them in compressed and index form.

Analysis configuration

To perform an analysis with this workflow, users will need to configure the workflow. This includes providing environment-related information like output locations, as well as analysis settings like reference population selection. This information is all declared and stored using the config/manifest.json file.

The manifest.json file

This file is responsible for declaring all information relating to the analysis and serves as the central point of contact between the workflow runtime and your input data. It is also used to configure and synchronize any sub-workflows imported internally.

manifest.json format example
input <object>
datasets <Array<str>>
A list representing the file-path to the dataset metadata file. Should be suitable for use with the python os.path.join() function.
locations <Array<str>>
A list representing the file-path to the location metadata file. Should be suitable for use with the python os.path.join() function.
samples <Array<str>>
A list representing the file-path to the samples metadata file. Should be suitable for use with the python os.path.join() function.
transcripts <Array<str>>
A list representing the file-path to the transcript metadata file. Should be suitable for use with the python os.path.join() function.
output <Array<str>>
A list representing a path to a folder where the results of the analtysis should be stored. If the folder does not exist, it will be created.
resources <Object>
reference_genomes <Array<Object>>
This property should contain a list of objects, where each object describes a reference genome available for use, using teh following properties:
name <str>
The name of the reference genome. Should correspond to value used in dataset metadata file.
location <Array<str>>
A list representative the file-path to the reference genome. Should be provided in FASTA format.
parameters <Object>
fishers-test <object>
cluster_name* <str>
The name of the cluster-level declared in your sample metadata file for which you would like to declare a reference population. This population will be used to conduct pair-wise testing against all remaining populations in the column respectively.
  {
    "input": {
        "datasets": [
            "/",
            "path",
            "to",
            "my",
            "dataset",
            "metadata"
        ]
    },
    "output": [
        "/",
        "path",
        "to",
        "my",
        "output",
        "location"
    ],
    "resources": {
        "reference_genomes": [
            {
                "name": "",
                "location": [
                    "/",
                    "path",
                    "to",
                    "my",
                    "reference",
                    "genome"
                ]
            }
        ]
    }
}

Metadata

All data and sample metadata is provided in the form of ` .csv files declared in the manifest.json` file. These files allow you to declare datasets and provide the necessary information to determine which contig-level files should be used for analysis given the provided genomic coordinates. For convenience, we will assume standard names for the sake of this explanation:

This design-pattern of declaring metadata files via the manifest.json was chosen specifically to allow users to create and store analysis configurations and metadata alongside the dataset files, which often has special storage requirements (e.g. space, access, etc). This approach allows centralized dataset management and then only requires that the manifest.json file is discoverable under the path config/manifest.json. That can be accomplished with a symlink or shortcut, keeping the amount of setup work to a minimum.

datasets.csv Metadata

The dataset metadata file allows you to declare information about your datasets to analyze, including the reference genome version and where to locate the files.

Please provide data in the form of multiple *.vcf files split per-contig.

Format example
dataset_name <str>
The name of the dataset. This value will be used as a universal accessor for that dataset and any information relating to it. This means that any output files will use this value to determine things like filenames, etc. It is also used to connect other metadata to this dataset computationally, E.g. sample-level information.
E.g.</b> 1000G</dd>
reference_genome <str>
An enum indicating which reference genome version this dataset has been called on.
E.g.</b> GRCh37 or GRCh38</dd>
file <file_path>
A file path indicating the location of the dataset to be used in the analysis.
E.g.</b> GRCh37 or GRCh38</dd> </dl> | **dataset_name** | **reference_genome** | **file** | | :--------------- | :------------------- | :---------------------------------------------------------- | | HG002 | GRCh38 | `/nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz` | | HG002 | GRCh38 | `/nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz` | | HG002 | GRCh38 | `/nlustre/users/graeme/PUBLIC/GenomeInABottle/HG002.vcf.gz` | </details>

Creative Commons by Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) Powered by Snakemake

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.. This project is managed by the Institute for Cellular and Molecular Medicine at the University of Pretoria.