The field of bioinformatics exploded onto the scene preceding the wealth of biological data generated by the human genome project.
It’s a highly interdisciplinary field requiring expertise in biology, computer science, medicine and mathematics to name a few. The role of bioinformaticians in medical research is to apply information technology to analyse, interpret and extract life-saving knowledge from a sea of biological data.
Expertise in handling bioinformatics data.
With the rapid evolution of bioinformatics, researchers are required to continuously test and explore new algorithms and methodologies, resulting in high data throughput. Unlike physicists, chemists and astrophysicists, bioinformaticians are also non-traditional high-performance computing (HPC) users with unconventional data utilisation techniques – their mathematical methods are unstructured, with colossal amounts of genomic data stored and analysed via complex access patterns.
To meet the demanding compute needs of bioinformaticians, HPC systems have been rapidly evolving with new applications and software packages being constantly developed and optimised, permitting their more sophisticated data usage.
At DUG, our HPC experts have been working extensively with the bioinformatics community in Australia to build, install and fine-tune several software packages and workflow systems in the field. With these packages seamlessly integrated into our DUG McCloud HPC environment, coupled with on-demand support from our highly skilled HPC experts, researchers can focus on their science, without worrying about the HPC challenges that used to plague them.
Here are some of the open-source software packages and tools our HPC specialists are using at DUG for projects related to bioinformatics:
- Sambamba – A highly parallel, robust and fast tool (and library) written in the D programming language to work with SAM and BAM files. Due to its high efficiency, Sambamba is one of the most important workhorses running in many genome sequencing centres around the world.
- Samtools – A suite of programs interacting with high-throughput sequencing data. It consists of three separate repositories: samtools (handles SAM, BAM and CRAM files), htslib (C-library handles high-throughput sequencing data) and bcftools (handles VCF and BCF files, as well as calling SNP data).
- Bowtie 2 – An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences, which supports gapped, local and paired-end alignment modes. It is particularly good at aligning relatively long mRNA genomes such as those of mammals.
- Burrows-Wheeler Aligner (BWA) – A software package for mapping DNA sequences against a large reference genome, such as the human genome.
- Genome Analysis Toolkit (GATK) – A genomic analysis toolkit developed by the Broad Institute with a primary focus on variant discovery and genotyping.
- Trinity – Assembles transcript sequences from Illumina RNA-Seq data. It does the whole assembly within itself by sequentially applying different software modules to process large volumes of RNA-seq reads, which could consume a large amount of memory. HPC experts at DUG completed a project with a data input of 86 GB and a data output of 1.2 TB using 435 GB of memory.
Workloads in bioinformatics – often big search problems – are generally carried out using not just one particular software package, but a combination of many individual, independent toolkits, such as those listed above or those found in Bioconda, Biocontainer and Github repositories. These software and tools are strung together to perform a specific task at each step – altogether constituting an entire bioinformatics sequencing project. HPC specialists at DUG are experts at analysing and optimising the bottlenecks in these complex workflows, and implementing different workflow management systems such as Cromwell, Nextflow and Snakemake to utilise DUG’s compute resources efficiently.
Accelerating medical research at Harry Perkins Institute of Medical Research.
Our unlimited HPC scale and tailored code-optimisation support enabled bioinformaticians at the Harry Perkins Institute of Medical Research to gain quick and easy access to their huge datasets, which allowed them to efficiently implement and run all their existing workflows, giving them the confidence to conduct science at record speeds that they previously could have only dreamed of.
DUG’s HPC expertise, powered by VAST Universal Storage and Intel Hardware, delivered a step-change in Harry Perkins’ HPC capabilities, facilitating their medical research to tackle some of the world’s biggest health issues including cancer and a number of rare genetic diseases.