To meet the ever-rising demand for compute cycles, high performance computing (HPC) centers have long counted on Moore’s Law, which roughly states that the number of transistors in an integrated circuit doubles about every two years. We are rapidly entering a post-Moore’s-Law age, especially where compute cycles per dollar are concerned. However, “Livermore Computing (LC) faces increasing demand for compute cycles to meet the Lab’s mission-critical simulation needs,” according to LC Chief Technology Officer Bronis R. de Supinski.
With the heterogeneous central and graphics processing units (CPU–GPU) node architecture of Sierra, LC began to address this issue of how to accomplish more cost-effective science. Still, “Sierra’s architecture is mostly general purpose while artificial intelligence (AI) accelerators are specialized and therefore very efficient,” says de Supinski.
Toward Fully Heterogeneous Computing
LC’s Advanced Technology Office, which de Supinski leads, sited two different AI accelerators at LLNL in 2020. The first was the Cerebras wafer-scale AI engine, attached to Lassen, Sierra’s smaller, unclassified companion system. Soon after, the team integrated an AI accelerator from SambaNova Systems into the Corona supercomputing cluster. LLNL researchers are using both of these heterogeneous system architectures to explore the combination of HPC and AI.
“Our strategy is already demonstrating that this approach will provide more cost-efficient solutions for the workloads of the future,” says de Supinski.
AI is computationally intensive, making it well suited to fully heterogeneous systems. “We can offload certain computations to AI or other accelerators while the regular HPC computations can carry on at the same time,” adds former principal HPC strategist Ian Karlin.
Additionally, AI accelerators like Cerebras and SambaNova are more efficient and quicker at handling the types of computations necessary for LLNL’s complex scientific problems, such as mesh management and inertial confinement fusion. The integrations will accelerate LC workflows, making them faster and more accurate while improving user productivity.
Solving the Stranded Resource Problem
On traditional supercomputers, different parts of a single run of a program—called a job—have different computational needs. This process can leave resources on the system “stranded,” or underutilized, especially during long or complex jobs.
Karlin explains, “Different LLNL workflows have different ratios of needs, so being able to mix and match on a per-job basis adds flexibility. So, we want accelerators available on the network, rather than per node, to address the stranded resource problem.”
This work is a large step toward the end goal of production usage: On a system with an AI accelerator, a developer would submit a complex HPC job to resource management software such as Flux. The job would contain a set of parameters that designate a portion of the work as suitable for AI acceleration. The HPC portion of the job would run in the traditional way on the CPUs and GPUs while the AI element is sent to the accelerator for specialized treatment. Results or even improved parameters would be sent from the AI accelerator back to the main job. This information could then aid in creating more optimal parameters for future runs, allowing each piece of the heterogeneous system to run separately, but in concert, for a more efficient job.
As de Supinski puts it, “Instead of computing every step along the way, you are jumping along the timeline. A single run might take longer, but the total cost is decreased because we get better results from the approximations that these technologies allow us to make. The total time to solution is decreased by working smarter, not harder.”
Beating Moore’s Law
Both the Cerebras and SambaNova projects employ a “Center of Excellence” model to join industry’s hardware, low-level software, and machine-learning expertise with LLNL’s expertise in applications, AI, and other areas of data science, and system design. Initial results with both solutions show an approximately five-times performance improvement per transistor over CPUs and GPUs alone.
Looking a little farther into the future, El Capitan will add heterogeneity relative to Sierra. HPE’s near-node storage solution, called Rabbit, which will be deployed throughout the system with one Rabbit for every compute chassis, was a key factor in the vendor selection. Rabbits enable more efficient defensive input–output and reduce system interference, which is especially important for the complex workflows that will continue to advance LLNL computational science.
The Cutting Edge
LLNL was the first national lab to integrate Cerebras’s engine into one of the top 20 fastest supercomputers in the world (Lassen) and, in doing so, is able to test at large scale the value of these new technologies for AI plus HPC workflows.
“The closest anyone else gets to this level of full heterogeneity is cloud providers,” concludes de Supinski, “but their systems do not have the low latency required for our workloads. So, we are pushing out and they are pushing in, and we’ll meet in the middle. The cutting-edge technology that LLNL and our sponsors are investing in is shaping the future. Someday, technologies such as AI accelerators and near-node per-chassis storage will be standard, but for now, making these innovations a reality is the focus of our biggest systems.”