Advancing bioinformatics with high-performance computing.

Advancing bioinformatics with high-performance computing.

The field of bioinformatics exploded onto the scene preceding the wealth of biological data generated by the human genome project.

It’s a highly interdisciplinary field requiring expertise in biology, computer science, medicine and mathematics, to name a few. The role of bioinformaticians in medical research is to apply information technology to analyse, interpret and extract life-saving knowledge from a sea of biological data.

Expertise in handling bioinformatics data.

With the rapid evolution of bioinformatics, researchers are required to continuously test and explore new algorithms and methodologies, resulting in high data throughput.

Unlike physicists, chemists and astrophysicists, bioinformaticians are also non-traditional high-performance computing (HPC) users with unconventional data utilisation techniques—their mathematical methods are unstructured, with colossal amounts of genomic data stored and analysed via complex access patterns.

To meet the demanding compute requirements of bioinformaticians, HPC systems have been rapidly evolving with new applications and software packages being constantly developed and optimised, permitting their more sophisticated data usage.

At DUG, our HPC as a service (HPCaaS) experts have been working extensively with the bioinformatics community in Australia to build, install and fine-tune several software packages and workflow systems in the field. With these packages seamlessly integrated into our cloud-based HPC environment, coupled with on-demand support from our highly skilled HPCaaS experts, researchers can focus on their science, without worrying about the HPC challenges that used to occupy a large portion of their time.

Here are some of the open-source software packages and tools our HPCaaS specialists are using at DUG for projects related to bioinformatics:

  • Alphafold: An artificial intelligence (AI) program developed by Alphabet’s/Google’s DeepMind, which performs predictions of protein structures.
  • Sambamba: A highly parallel, robust and fast tool (and library) written in the D programming language to work with SAM and BAM files. Due to its high efficiency, Sambamba is one of the most important workhorses running in many genome sequencing centres around the world.
  • Samtools: A suite of programs interacting with high-throughput sequencing data. It consists of three separate repositories: samtools (handles SAM, BAM and CRAM files), htslib (C-library handles high-throughput sequencing data) and bcftools (handles VCF and BCF files, as well as calling SNP data).
  • Bowtie 2: An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences, which supports gapped, local and paired-end alignment modes. It is particularly good at aligning relatively long mRNA genomes such as those of mammals.
  • Burrows-Wheeler Aligner (BWA): A software package for mapping DNA sequences against a large reference genome, such as the human genome.
  • Genome Analysis Toolkit (GATK): A genomic analysis toolkit developed by the Broad Institute with a primary focus on variant discovery and genotyping.
  • Trinity: Assembles transcript sequences from Illumina RNA-Seq data. It does the whole assembly within itself by sequentially applying different software modules to process large volumes of RNA-seq reads, which could consume a large amount of memory. HPC experts at DUG completed a project with a data input of 86 GB and a data output of 1.2 TB using 435 GB of memory.

Workloads in bioinformatics—often big search problems—are generally carried out using not just one particular software package, but a combination of many individual, independent toolkits, such as those listed above or those found in Bioconda, Biocontainer and Github repositories. 

These software and tools are strung together to perform a specific task at each step—altogether constituting an entire bioinformatics sequencing project. 

HPCaaS specialists at DUG are experts at analysing and optimising the bottlenecks in these complex workflows, and implementing different workflow management systems such as Cromwell, Nextflow and Snakemake to utilise DUG’s compute resources efficiently.

Accelerating medical research to advance human health.

Our unlimited HPC scale and tailored code-optimisation support enabled bioinformaticians at the Harry Perkins Institute of Medical Research, Telethon Kids Institute and the Commonwealth Scientific and Industrial Research Organisation (CSIRO) to gain quick and easy access to their huge datasets. This enabled them to efficiently implement and run their existing workflows, giving them the confidence to conduct science at record speeds that were previously hard to achieve in a traditional set-up.

DUG’s HPCaaS expertise, powered by VAST Universal Storage and Intel Hardware, delivered a step-change in Harry Perkins’ HPC capabilities, facilitating their medical research to tackle some of the world’s biggest health issues including cancer and a number of rare genetic diseases.

Check out our Harry Perkins case study here.

We’re excited to play a part in accelerating the translation and commercialisation of your bioinformatics and life sciences research. Please reach out to us at [email protected] to learn more about our HPCaaS capabilities, which have a focus on:

  • Data sovereignty.
  • Data and cyber security.
  • Green computing for a positive environmental, social and governance (ESG) footprint.
  • Dedicated HPC support.
  • Domain-specific expertise.

By Mitchell Lim

Mitchell Lim is DUG's Scientific Content Architect. With a PhD in Chemical Engineering, Mitch is an expert in the fields of catalysis and ultrasonics. Full-time science geek, part-time fitness junkie, Mitch strives to deliver effective and engaging science communication, as he believes that easily digestible scientific perspectives have the potential to impact and benefit society at large.

DUG Technology