Road to El Capitan 10: Collaboration is key

This article is part of a series about Livermore Computing’s efforts to stand up the NNSA’s first exascale supercomputer. El Capitan will come online in 2024 with the processing power of more than 2 exaflops, or 2 quintillion (10¹⁸) calculations per second. The system will be used for predictive modeling and simulation in support of the stockpile stewardship program.

Previous: Messaging and math

Under the moniker Tri-Labs, high performance computing experts at Lawrence Livermore, Los Alamos, and Sandia National Laboratories work closely on initiatives for the NNSA Advanced Simulation and Computing (ASC) program. For example, this team develops and deploys the custom Tri-Lab Operating System Stack (see the article The right operating system in this series). Another area of collaboration is the Center of Excellence (CoE), which extends to El Capitan’s hardware partners Hewlett Packard Enterprise (HPE) and Advanced Micro Devices Inc. (AMD).

The CoE is more than an administrative entity. Whereas working groups may include additional organizations with a stake in El Capitan’s success, the CoE facilitates communication of requirements and pain points. “Our number one goal is that on 'Day Zero,' when this amazing system comes online, our programmatic applications from all three labs are ready to make efficient and productive use of it,” says Judy Hill, Livermore’s CoE lead. Anna Pietarila Graham from Los Alamos and James Elliott from Sandia round out the Tri-Lab’s CoE leads.

Vendor and CoE staff work side by side with Tri-Lab teams. One key activity is the recurring hackathon, which brings together developers and hardware experts from each participating organization to fine-tune applications on El Capitan’s early access systems. The CoE leads emphasize the value of hackathon interactions. “A core group of code teams attend all the New Mexico hackathons [at Sandia], and some travel to Livermore because they find it really useful to have not just the vendors there, but to have three days of dedicated time to work on their codes without distractions,” notes Pietarila Graham.

Between hackathons, the CoE conveys bugs and feature requests to HPE and AMD so that the overall software stack matures, scales, and stabilizes. “We help the vendors understand our challenges. From their point of view, our codes are massive and complex and written over many years, and their standard customers aren’t doing the kind of science we are,” Elliott explains. “We run into edge-case problems that vendors wouldn’t normally encounter—the kinds of problems that arise when you push the limits. I compare it to race car drivers versus regular drivers. The problems are of a different type and scale.”

This effort is a win–win: Tri-Lab expertise helps the vendors improve their R&D when these unique issues are resolved. For instance, issues with application correctness and performance reported through the CoE have led to vendor-released software updates to address those issues. Both vendors acknowledge this mutually beneficial experience.

“Our close collaboration with LLNL and AMD has been absolutely vital in designing and deploying a transformative system like El Capitan,” states Gerald Kleyn, vice president, Customer Solutions, HPC & AI, HPE. “Our work together has spanned multiple years and has involved partnering on overall design, updating the LLNL facility, and ensuring application readiness through the CoE. This has enabled HPE to share our expertise and, in turn, learn from other world-class experts across a range of fields as we test new features and gain insights into priority applications that will secure better outcomes for our customers.”

Nick Malaya, an AMD Fellow in High Performance Computing, adds, “The collaboration with Livermore Computing and the Tri-Labs has been instrumental in ensuring El Capitan's hardware and software readiness. The CoE has assembled some of the world’s foremost experts in computer architecture, computer science, and scientific computing. The team has been able to anticipate and address complex challenges in software porting and tuning, while also pushing the boundaries of what’s possible in HPC.”

Even with cutting-edge software and hardware, El Capitan’s productivity will hinge on robust user support. Pietarila Graham states, “A big part of the CoE’s responsibilities is advocating for all future users—not only making sure the applications will run, but that they’ll run for regular users in addition to experienced code developers.”

When the NNSA’s first exascale supercomputer accepts its first user login and fires up its first application, the Tri-Lab collaboration will continue as the labs tackle their shared national security mission. Hill says, “I’m excited for those moments when El Capitan will surpass our expectations.”

Previous: Messaging and math

—Holly Auten & Meg Epperly