1.5MW and 3600 penguins in a room: Supercomputing at ANU
|Project:||mainly oneSIS, bobMonitor, kernel and touching on OpenMPI, zfsonlinux, PBS|
At the NCI National Facility at ANU we exclusively use Linux for supercomputing. This talk covers how we use and alter (and hopefully help to improve) Linux and other Open Source projects to create a fast and stable petascale machine.
- How we boot and run clusters at large scale using oneSIS, CentOS, root on Lustre, ...
- How we tweak the kernel and tune the OS to keep them stable and efficient
- Problems we've seen and areas we'd like to see improved
Our current supercomputer is 12k Nehalem cores with a 20GB/s filesystem and QDR InfinBand. The next machine (hopefully operational by LCA time) is 60k Sandy Bridge cores with a 100GB/s filesystem and FDR InfiniBand. Each of our new clusters is usually in the top 50 machines in the world.
oneSIS and the Lustre filesystem together with CentOS running vanilla and modified RHEL6 kernels are the primary components of our software stack. Uniquely the OS's for all 3600 nodes boot from a single copy of the OS that resides on the global Lustre filesystem. This gives us an easily updatable OS and guarantees all compute nodes are booting identical software.
High performance and predictability is a key focus. The main player in this game is numa awareness. We use cpusets extensively and enforce processor and memory binding to obtain ideal numa alignment and optimal and reproducible job performance. We have worked with the OpenMPI project to get binding included, and our users run jobs with binding on by default.
We are increasingly using cgroups and eventually may use all of cputime, freeze/thaw, acl, namespaces, and memcg - all tasks which our own resource manager software currently handles. We also have a history and interest in page migration which may again become useful to reduce machine fragmentation as each numa zone gets more cores.
Like many HPC sites we write our own job and cluster monitoring tools and these are available for other sites to use and modify, as are our changes to kernels and some InfiniBand tools. Unlike many sites we also write our own batch system.
I'll talk about our Linux improvement wishlist too - eg. Lustre integration into mainline so we can update kernels more aggressively and get shiny new stability. inode/dentry reclaim of deleted inodes with vfs_cache_pressure=0. Low overhead control groups. Better transparent huge pages. Improved virtual memory stability. md is awesome but zfs-on-linux might be better. OFED integration with mainline.
Robin is a reformed computational astrophysicist. He's been building clusters since 1998 and top 50 machines since 2003. He's currently the Systems Analyst for the NCI National Facility at ANU, where he finds ways to make Australia's largest HPC machines run better and faster. He mostly works on operating systems and filesystems. He likes cats and hates all disks.