HPC networking: a ballad in four parts. (Part 3: Resilience)

High Performance Computing (HPC) networking is very different to the traditional networks that ISP’s, businesses and enterprises build. The ideas and concepts are orthogonal.

The whole system has to be thought of as a single ‘computer’ with an internal fabric joining it all up. They are mostly ‘set and forget’ which require a lot of setup and configuration, with few changes over the life of the system.

The critical concept with HPC fabrics is to have massive amounts of scalable bandwidth (throughput) whilst keeping the latency ultra-low. In this four-part series we discuss what makes HPC Networking truly unique. This is Part 3: Resilience. You can go back and read Part 1: Latency here and Part 2: Bandwidth here.

Resilience.

As can be seen in the fat-tree diagram, there are multiple equal-cost paths from A to B. With a well implemented fabric, you can lose any spine and only suffer a drop in aggregate bandwidth (eg. 1/16 bandwidth in the example). If you lose a leaf, you lose connectivity to the compute servers attached.

The latter can be handled by replicating the fat-tree and going ‘dual-rail’ which means that each compute server is attached to two fat-tree fabrics. While this appears to double the cost, it doesn’t necessarily.

In this case, each individual connection to the servers is 50Gb/s, with an aggregate of 100Gb/s. This doubles the number of servers attached to each fat-tree.

This dual-plane fat-tree provides 100% resilience to switch failure and at most means you lose 50% of your bandwidth. The rest of the fabric still delivers full bandwidth.

Keep an eye out soon for the final installment in our HPC networking series: Protocols.

If you’re enjoying our HPC Networking series, follow us on Twitter for regular updates.

HPC networking: a ballad in four parts. (Part 3: Resilience)

By Stuart Midgley

All about DUG

What do wheat and supercomputers have in common?

Engineering better solutions

High-performance Computing

Services

Software

Multi-Client

About Us

Newsroom

Resource Library

Investor Centre

HPC networking: a ballad in four parts. (Part 3: Resilience)

Share this post

By Stuart Midgley

You might also like...

All about DUG

What do wheat and supercomputers have in common?

Engineering better solutions

Hey, like what you're reading?

High-performance Computing

Services

Software

Multi-Client

About Us

Newsroom

Resource Library

Investor Centre