Preferred location: London, UK
Secondary location: Houston, TX or Perth, AU
Remote work will be considered for truly exceptional candidates with a history of successful remote work.
This role combines software engineering and very strong Linux expertise to build scalable infrastructure for, and diagnose challenging problems in, large HPC systems.
- [> 50%] Implementing and maintaining software to improve the availability, scalability, maintainability, security, and performance of our HPC systems. For example:
- Automated monitoring systems, which reduce the amount of human toil
- Automated self-repair systems for simple / recurring issues
- Properly-targeted alerting for truly urgent issues
- Infrastructure such as file systems, job queueing systems, archival and multi-site synchronisation tools, etc.
- Tools to help end-user geoscientists use and manage HPC resources
- [< 50%] Helping troubleshoot the complete stack of hardware and software, when issues are escalated from IT technicians. Generally pursuing and fixing the root causes of hard system issues.
- Being the company expert for one or more major IT subsystems (e.g. OS, Lustre, Slurm, networking, etc.)
- Helping to develop procedures and tools for IT technicians
- Participating in the on-call roster for urgent weekend issues
- Provides a strong feedback loop for improving automation and properly-targeted alerting.
- Software development expertise in at least C/C++ and one or more scripting languages
- Ability to diagnose complex Linux problems, and identify opportunities to improve
- Ability to operate independently, with ownership and accountability
- Familiarity with software collaboration tools such as revision control and issue tracking
- Organisation and attention to detail
- Excellent spoken and written English.
The ideal candidate would also have substantial experience with one or more of:
- Linux kernel programming
- Lustre file system internals
- System automation frameworks
- IT security.