A state-of-the-art, high-performance computing (HPC) system was being migrated and redeployed to replace a legacy Department of Defense (DoD) HPC with the goal of providing a faster, more stable HPC for DoD scientists’ computer-based computations and simulations.
To redesign, engineer, manage, and operate a newer, faster, cost-effective replacement of a legacy high-performance computing system.
Upon arrival at the DoD campus, the SMS team brought the HPC to a controlled work environment for initial testing. The HPC was reconfigured to function using a Defense Research and Engineering Network (DREN) Unix operating system image and existing DREN networking components. Once the reconfiguration was complete, the SMS team performed a proof of concept using a full “M-Cell,” which consists of four compute node racks (each compute node consists of 2400 CPU cores and 2.348TB of memory), two center cooling racks, and, as the entire system is water-cooled, one water pump rack.
During initial testing, the SMS team first noticed discrepancies in performance, which, upon troubleshooting, were determined to be caused by temperature fluctuations in each of the compute nodes. In order to properly troubleshoot these temperature fluctuations, the team incorporated open source software into the Unix operating system image to easily identify the temperature of each compute node. This enabled temperature mapping for each rack as a whole, which highlighted emerging patterns in the temperature fluctuations that pointed to debris buildup in the water cooling system as the root cause. The cooling system was then flushed, leading to a drastic improvement in the performance of each compute node. Safety controls were then built in for each node, preventing it from exceeding a temperature threshold by performing automated shutdowns due to thermal events. This precaution eliminated the majority of the power and cooling issues with the HPC, and yielded metrics showing how the machine functioned from a computing standpoint.
Following proof of concept testing, the SMS team completed the full HPC installation reutilizing fiber optic cabling obtained during the HPC disassembly and also deployed monitoring systems to prevent any future thermal events. Because the team had already worked out the HPC Unix software installation for configuration for the M-Cell during proof of concept, they successfully got the system up and running in significantly less time.
For ongoing monitoring of each compute node, the SMS team also built a custom web GUI that displays a physical representation for each compute rack in the M-Cell, showing temperatures of each compute node in that rack. Following the deployment, the team finalized the build of the queuing system for customers to submit compute jobs. This queueing system is based on the same system that large HPC centers use so that the scientists can test and model their code first and then easily move it to the larger centers if they need more horsepower.
The newly repurposed HPC will allow scientists within multiple directorates to build scientific computer modeling simulations and to perform smaller batch run code testing to ascertain quick solutions to questions that arise within their projects. Through the successful redesign of the HPC to replace a legacy, aging platform, SMS helped the DoD lab save millions of dollars in the cost of a new HPC.