Announced in August 2019, the Department of Energy (DOE), the National Nuclear Security Administration (NNSA), and LLNL are partnering with Hewlett Packard Enterprise (HPE) and Advanced Micro Devices (AMD) to build NNSA’s first exascale supercomputer, El Capitan. Procurement of this system ensures the NNSA laboratories—LLNL, Sandia (SNL), and Los Alamos (LANL)—can meet their primary mission of keeping the nation’s nuclear stockpile safe, secure, and effective.
When El Capitan comes online in 2023, it will have a peak performance of more than 2 exaflops (2 quintillion calculations per second)—roughly 16 times faster on average than the Sierra system—and is projected to be several times more energy efficient than Sierra.
Performance: Speed Plus Efficiency
It’s not just speed that will make El Capitan noteworthy. LLNL’s future flagship advanced technology system (ATS) will have “a number of things that move the architecture for large systems forward,” according to Bronis R. de Supinski, Livermore Computing (LC) Chief Technology Officer. In addition to the HPE Cray EX architecture that includes Slingshot networking and advanced per-node central and graphics processing unit (CPU–GPU) technology from AMD, the platform will come equipped with first-of-its-kind HPE Cray “Rabbits,” a flexible, disaggregated near-node local storage innovation.
“We expect that using Rabbit modules will significantly reduce system interference from input–output (I/O) on the overall system,” says de Supinski.
He explains the rationale of taking such risks on novel architectures. “At LLNL, we try to be innovative, but in a way that gives us confidence in the innovations working,” he says. “The Rabbits are a good example of this. It’s a multilayered strategy. If the Rabbits fail, we’d have to use the traditional file systems as backup. But, if the Rabbits work, we’re getting an order of magnitude faster I/O.”
According to de Supinski, such innovations are not for show. “It all comes down to cost-effective use of budget and the resources deployed. The most cost-effective system is the one you’re using 100% of the time: the network, the file system, the compute nodes. We don’t want to buy resources that will not be well-utilized most of the time. It’s a waste.”
Partnerships with Industry
LLNL’s most recent ATS machines were provisioned by IBM and NVIDIA, so building El Capitan means building a new working relationship with a large vendor, HPE. “In general,” says de Supinski, “the partnership between LLNL and HPE and between HPE and AMD is going very well. We’ve already sited and accepted two early access systems, which establishes our way of working together.”
Teaming Up across the DOE Complex
El Capitan’s acquisition was part of the CORAL-2 procurement, the second collaboration between DOE’s Oak Ridge, Argonne, and Lawrence Livermore national labs. Oak Ridge’s Frontier and Livermore’s El Capitan are both HPE systems, so the effort to prepare for the systems can be shared by these two labs with contributions from Sandia, Los Alamos, Argonne, and Berkeley labs.
Preparations are coordinated by working groups on system software, compilers, libraries, I/O subsystems, system assurance, messaging, and performance and correctness tools. LLNL’s El Capitan Center of Excellence (CoE) involves applications teams from the “tri-labs”—LLNL, LANL, and SNL—as part of the NNSA’s Advanced Simulation and Computing (ASC) Program.
According to David Richards, LLNL’s CoE lead, “The CoE’s goal is to help ASC developers at LLNL, LANL, and SNL prepare their codes to run efficiently and effectively on day one when El Capitan is delivered. The CoE team consists of about a dozen experts from HPE and AMD who are helping to optimize lab codes and improve the HPE/AMD software environment including compilers, debuggers, and other developer tools.”
LLNL recently installed the early access system RZNevada for CoE application development. According to Richards, “RZNevada is also being used to test TOSS 4 (the tri-lab open source operating system first developed at LLNL) and a new Linux user environment. Both are evolving rapidly. The CoE also has access to Redwood, a similarly architected platform that is hosted at HPE. Redwood will soon be joined by Cedar.”
Closer to home, de Supinski extends credit for the effort to LC’s staff, including key contributors in its Advanced Technologies Office. “My whole team and all of LC are busy working to be ready for this system, and we’ve got engagements throughout the Lab.”