GA blog post
Concepts we focused on
- Reproducibility
- Scalability
- Portability
- Speed
- Standardisation
- Best practice
Pipelines
- human_genomics_pipeline: https://github.com/ESR-NZ/human_genomics_pipeline
- vcf_annotation_pipeline: https://github.com/ESR-NZ/vcf_annotation_pipeline
We have developed two pipelines that are designed to complement one another. The first pipeline (human_genomic_pipeline) processes paired-end sequencing data using bwa and GATK4, taking the data from fastq to vcf. Quality control checks are also undertaken. The second pipeline filters the raw variants and annotates the vcf files using GATK4, SnpSift, VEP, genmod and dbSNP. Lastly, the vcf file can be optionally prepared for ingestion into scout, a user friendly, open source software that allows clinicians to explore the variants (http://www.clinicalgenomics.se/scout/).
Features
- Open source - freely available in github
- Written in a workflow language (Snakemake)
- Deployable to a HPC or a single server
- Analyse exome or genome data
- Single sample or cohort/family analysis
- Run on CPU’s or GPU’s (parabricks)
- Species agnostic
- Regularly maintained
- Enthusiasm to support community contribution!
How the pipelines are scalable
- Snakemake is flexible when handling variable numbers of samples and resource availabilities - Snakemake is a bit like Tetris, it’s smart enough to know if and when more samples can be processed when they fit within the overall resource limits set by the user
How the pipelines are reproducible
- Documentation to provide clear instruction on how to deploy the pipelines
- Open source - pipelines are available to reviewers in the peer review process
- Previous versions of the pipelines available because of version tracking in github
- User configuration of the pipeline settings are physically separated from the core pipeline, reducing the chance of unintentional changes to the core pipeline
- Log files written for nearly all steps in the pipeline
- The final report includes an interactive plot outlining the complete pipeline workflow and each file that was input and output by each rule/step
How the pipelines are portable
- Minimal software requirements in order to run the pipelines - only conda which is commonly available on research servers and HPC’s (and a parabricks licence and GPU/s if running GPU enabled)
- There are discussions underway about obtaining a Parabricks licence for NeSi - in this case, GPU accelerated pipeline runs could be available to researchers and students New Zealand wide that have access to NeSi
- Open source and available on github - can be easily cloned to any machine with a single git command
- Available resources (such as the maximum number of cores) can be set by the user, allowing the pipelines to be deployed on machines or HPC’s with variable resource availabilities
How the pipelines are speedy
- The pipelines are integrated with NVIDIA Clara Parabricks (https://www.nvidia.com/en-us/docs/parabricks/quickstart-guide/software-overview/). This means that the user can achieve very significant speedups when analysing a few samples by setting a single flag in the configuration file (if the user has a parabricks licence and GPU/s)
- The pipelines are scalable - many samples can be processed at one time
How the pipelines encourage standardisation
- Open source - allows other labs in New Zealand to freely use the pipelines
- Designed to be species agnostic - we have developed these pipelines with human genomic data in mind. However, the pipelines have been designed to allow the user to analyse genomic data from other species by simply including their species specific reference genome and variant databases in the pipeline configuration file
- Available on github - the genomics/bioinformatics community is able and encouraged to contribute to the pipelines to further develop the pipelines to fit the needs of other labs in New Zealand
- GPU enabled runs using the NVIDIA Clara Parabricks is designed to match a CPU run by using the same software and parameters
How the pipelines are best practice
- Designed to follow the current GATK best practice workflows for germline short variant discovery (https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-)