Physics vs Bioinformatics – rumble in the HPC jungle.

I often hear anecdotes comparing physicists and bioinformaticians and their respective use of high-performance compute (HPC) systems.  A large number of HPC systems across the world are consumed with running jobs for physicists  – modelling atom collisions for example, or running large finite-difference codes for wave or fluid propagation.

By contrast, the bioinformaticians are often derided as poor users of HPC resources and receive little support.  

Why is this?  What’s going on?  Surely their research is as deserving of enormous number crunching as anything in the physics realm?

The answer is mostly historical.  

Ask yourself who invented the modern computer?  Physicists.  Three physicists John Bardeen, William Shockley, and Walter Brattain were responsible for inventing the transistor that made modern computing possible. Physicists then continued to revolutionize the computing industry, including Intel’s founders Gordon Moore (Chemist/Physicist) and Robert Noyce (Physicist). Is it any wonder that computers have evolved to solve physicist’s problems?

Physicists tend to have very structured data: large 3D models, broken up into regular grids, that can be easily vectorised, parallelised and run efficiently on computer hardware.  Often the input is a small handful of numbers describing the system and the results range from a small group of numbers to large regular grids, holding the output of the simulation. Physicists and computers have evolved together and, some may say, they’ve developed an unhealthy codependency, almost to the exclusion of all others.

Then, in the 90s, along came the plucky upstart – bioinformatics.  Bright-eyed and bushy-tailed.  A burgeoning field driven by the advances in genome sequencing and a huge potential to advance humanity’s health.  BUT their data, and subsequent algorithms, are very unstructured.  The manner in which a sequencer works is that small snippets of data are obtained that need to be searched through individually to find overlapping and matching components.

With bioinformatics there is no single correct algorithm and the answer is unknown.  From one workflow to the next, the way data is used is continuously changed.  Data isn’t read and processed in a serial fashion as it is in a physics-based algorithm. Instead, a large amount of compute effort is expended just to locate the tiny particle of data on which to append the genome sequence.

Unsurprisingly, the unstructured nature of the data, and the rapidly evolving computational approaches used in bioinformatical research, means that computer hardware and systems haven’t really had time to adjust.

I’m never one to shy away from controversy so I’ll come right out and say it (as a physicist)… bioinformaticians are far more sophisticated users of data than physicists.  The way they think about data is completely different and does not fit conventional computational systems.

BUT, the HPC industry is evolving to meet their demands.  New products like VAST Data Universal Storage  can drastically accelerate bioinformatic workflows.  The ability to randomly access data and suffer no penalty for “seeks” allows practictioners to use their data as they wish.  They are able to implement their unstructured workflows.  They are no longer restrained by having to conform to the physicist’s linear view of the world.

Is there little wonder there’s some contention?

I am excited by the direction the HPC industry is heading.  New ideas benefit everyone.

(Just don’t tell my fellow physicists I said so!)

 

By Stuart Midgley

Stuart Midgley is DUG's CIO and self-confessed "mad scientist". He holds a PhD in theoretical physics and is a world expert in high-performance computing. Stuart designed and developed the DUG Cool system of immersive cooling technology and was instrumental in the construction of DUG's world-class greenest data centres on earth. He's just as handy behind a BBQ. After all, he owns 17 of them.

DUG