This article is part of a series about Livermore Computing’s efforts to stand up the NNSA’s first exascale supercomputer. El Capitan will come online in 2024 with the processing power of more than 2 exaflops, or 2 quintillion (1018) calculations per second. The system will be used for predictive modeling and simulation in support of the stockpile stewardship program.
Previous: Packaging for everyone | Next: Messaging and math
Eavesdrop on enough conversations at the Livermore Computing Center and you’ll repeatedly hear what it takes to “get application codes up and running on El Capitan.” Just as lumber and nails alone don’t make a house, high performance computing (HPC) hardware components require specialized software tools—and experts to wield them—in order to operate. Led by Matthew LeGendre, El Capitan’s Tools Working Group focuses on correctness, debugging, and performance analysis tools.
Correctness software executes rigorous testing programs to confirm an application is running as intended, whereas debugging tools detect and fix errors that occur even when an application runs on the same hardware many times. LLNL teams use a range of performance analysis tools to monitor an HPC system’s power management, throughput, resource utilization, and other metrics.
The Tools Working Group’s mandate boils down to efficiency and accuracy. “We’re concentrating on your code once it’s running,” explains computer scientist Olga Pearce. “If it’s crashing or not producing correct answers, we use debugging and correctness tools. And when your code is working the way it should, and producing the correct answers, then you need performance tools to measure and monitor how fast it’s running.”
As with most aspects of the transition from petascale to exascale, and from CPUs to GPUs, these types of tools have undergone significant scale-up development. “At this point in the Sierra integration [2017–2018], we were concerned about the quality of debuggers,” recalls LeGendre. “In the intervening years, industry and the HPC community have realized the need to debug code for GPUs. Debuggers are working well at this phase of siting El Capitan.”
With more computing power at stake, the group continues to look closely at correctness and performance tools while sharing their expertise with El Capitan’s chip manufacturer, Advanced Micro Devices Inc. (AMD). The supercomputer will have AMD’s new integrated CPU/GPU processors, called APUs. Fittingly, the A in APU stands for accelerated. These processors will make up 100 percent of the machine’s peak speed, providing enormous floating-point performance alongside increased fidelity and repeatability of calculations.
LeGendre notes, “Part of this collaboration is knowledge transfer. AMD is learning how to build performance tools, while we’re learning how to use their processors.” The Tools Working Group also leverages open-source software like HPCToolkit as well as contributions from El Capitan’s interconnect and storage partner, Hewlett Packard Enterprise (HPE).“We are a pretty demanding customer,” Pearce adds. “Our application teams want to understand the performance of their codes to a much finer degree than the other customers these vendors may have dealt with in the past.”
User support is crucial in this new ecosystem. Application teams accustomed to Sierra’s NVIDIA GPUs will need to learn how their codes work on El Capitan’s AMD APUs. “Users will have different interfaces and workflows, but they’ll find some familiarity within what these tools are doing,” LeGendre says.
Both Pearce and LeGendre refer to El Capitan’s simulation capacity as “a game changer.” As the NNSA ramps up stockpile modernization efforts, the importance of investing in an exascale supercomputer remains front of mind for the teams working to bring the system online. Pearce states, “Scientists will be hard at work using El Capitan to respond to the Lab’s mission, and it will be rewarding to see how all of our efforts contribute to that.”
Previous: Packaging for everyone | Next: Messaging and math
—Holly Auten & Meg Epperly