Data Requirements

A summary of the required data and input files needed to perform an analysis.

Table of contents

Overview
Input Data
1. Compression and Indexing
Analysis configuration
1. Metadata
  1. locations.csv Metadata
  2. samples.csv Metadata

This page describes the information needed to run the Population Structure Workflow. Below we guide users through the system used to declare an analysis manifest, and all associated metadata files. For more information, please consult the relevant section below which contains more specific guidance, discussions and technical documentation.

Overview

This workflow makes use of an analysis manifest to encapsulate all analysis variables used. This manifest file collects and connects the metadata for your samples, datasets, and relevant reference resources (Reference Genomes, etc) together. Doing so allows the workflow to programmatically access clusters through sample annotations, which is required in order to produce cluster-level reports.

Input Data Infographic

---
title: Input Filemap
config:
    flowchart:
        defaultRenderer: elk
    elk:
        nodePlacementStrategy: BRANDES_KOEPF
---
flowchart TB

  subgraph input ["Input Files"]
      subgraph data [Datasets]
          datasetFile1{{<b>Dataset file</b><br><code>GnomAD_Chr1.vcf.gz</code>}}
          datasetFile2{{<b>Dataset file</b><br><code>GnomAD_Chr2.vcf.gz</code>}}
          datasetFileN{{<b>Dataset file</b><br><code>GnomAD_ChrN...vcf.gz</ code>}}
      end
      subgraph metadata ["Analysis metadata"]
          locationMeta{{<b>Coordinates for study</b><br><code>locations.csv</code>}}
          sampleMeta{{<b>Sample metadata</b><br><code>samples.csv</code>}}
      end
  end
  subgraph config [<code>config/</code>]
    configuration{{<b>Analysis configuration</b> <br><code>config/configuration.json</code>}}
  end
  population_structure_workflow[\Population structure Workflow/]
  click population_structure_workflow href "https://tuks-icmm.github.io/Population-Structure-Workflow/workflow/methodology" _blank

  
  input --> population_structure_workflow
  config ----> population_structure_workflow
  population_structure_workflow --> results

  results(((Results)))
    
    metadata -.-o|Describes| data

Input Data

This workflow is designed to work on variant-call-format files (.vcf file extension). The latest version of the VCF specification can be found here.

Compression and Indexing

This workflow can accept uncompressed VCF files, however this workflow will compress and index the data during handling for performance reasons. If possible, please provide them in compressed and index form

Analysis configuration

To perform an analysis with this workflow, users will need to configure the workflow. This includes providing environment-related information like output locations, as well as analysis settings like reference population selection. This information is all declared and stored using the config/manifest.json file.

The `manifest.json` file

This file is responsible for declaring all information relating to the analysis and serves as the central point of contact between the workflow runtime and your input data. It is also used to configure and synchronize any sub-workflows imported internally.

manifest.json format example

input <object>

locations <Array<str>>: A list representing the file-path to the location metadata file. Should be suitable for use with the python os.path.join() function.
samples <Array<str>>: A list representing the file-path to the samples metadata file. Should be suitable for use with the python os.path.join() function.

output <Array<str>>

A list representing a path to a folder where the results of the analysis should be stored. If the folder does not exist, it will be created.

  {
    "input": {
        "locations": [
            "/",
            "path",
            "to",
            "my",
            "locations",
            "metadata"
        ],
        "samples": [
            "/",
            "path",
            "to",
            "my",
            "samples",
            "metadata"
        ],
    },
    "output": [
        "/",
        "path",
        "to",
        "my",
        "output",
        "location"
    ]
}

Metadata

All data and sample metadata is provided in the form of ` .csv files declared in the manifest.json` file. These files allow you to declare datasets and provide the necessary information to determine which contig-level files should be used for analysis given the provided genomic coordinates. For convenience, we will assume standard names for the sake of this explanation:

This design-pattern of declaring metadata files via the manifest.json was chosen specifically to allow users to create and store analysis configurations and metadata alongside data, which often has special storage requirements (e.g. space, access, etc). Through the manifest.json file, all other analysis-specific files will be declared and will be accessible. This then only requires that the manifest.json file is discoverable under the path config/manifest.json, which can be accomplished with a symlink or shortcut, keeping the amount of setup work to a minimum.

`locations.csv` Metadata

The location metadata file allows you to declare samples and provide the necessary sample-level information for use in this pipeline.

Format example

location_name <str>: The ID of a gene or, if not a studied gene region, a unique identifier to reference this genomic coordinate window.
E.g. CYP2A6
chromosome <enum <int [0-24]> >: The chromosome number on which the above genomic region can be found.
E.g. 19
start <int>: The start coordinates for the genomic window.
E.g. 40842850
stop <int>: The stop coordinates for the genomic window.
E.g. 1000g
strand <enum [-1,1]>: The strand on which the genomic region can be found, where 1 denotes the forward strand and -1 denotes the reverse strand.
E.g. -1

location_name	chromosome	start	stop	strand
CYP2A6	19	40842850	40851138	-1
CYP2B6	19	40988570	41021110	1
UGT2B7	4	69045214	69112987	1

`samples.csv` Metadata

The sample metadata file allows you to declare samples and provide the necessary sample-level information for use in this pipeline.