Resilience has been identified as a top challenge in building and operating exascale supercomputers. Soft and hard errors will occur much more frequently in exascale systems than in today’s petascale machines because of the systems’ correspondingly greater number of microprocessors, memory devices, and other hardware components.
Traditional fault tolerance approaches, such as checkpoint/restart, may not work at exascale for every scientific application, since the time to checkpoint and restart may exceed the mean time to failure for the fastest supercomputers. Those approaches are fundamentally inefficient because they operate at a lower level than the application—that is, because they operate at the hardware and operating system levels, they save more data than is required to recover from failures.
Looking forward to exascale, application-level resilience is emerging as a promising alternative. Application-level resilience provides fault tolerance at a lower cost than traditional approaches by saving the minimum amount of data required for the application to recover from failures. A form of algorithmic-based fault tolerance, it saves extra information, either locally or remotely, which is then used in the event of a failure to reconstruct or to recover lost data.
LLNL researchers are implementing application-level resilience in ddcMD, an LLNL-developed large-scale molecular dynamics application. ddcMD has fault tolerance functionalities that allow it to roll back to a previous state, as long as the system is able to detect a fault and notify the application of it. This mechanism has been used effectively to recover from L1 cache transient faults in the BlueGene/L machine, and has allowed scientists to perform large-scale molecular dynamics simulations even in the presence of frequent errors.
We are extending ddcMD to allow recovery from process or node failures, as in exascale such failures will be commonplace. With application-level resilience, ddcMD has the ability to continue its execution in the presence of most errors without the need to restart the entire application (as the checkpoint/restart approach does).
The key idea is that critical state is replicated in neighbor nodes in rounds. When a node fails (for example, the node crashes), ddcMD moves forward with the available (healthy) nodes. Application-level resilience allows ddcMD to incorporate lost data again in its workload by retrieving them from the node in which they were saved in the previous round.
We are looking for ways to apply this technique in other scientific applications, and using a fault injection test bed with large-scale runs, we are working to improve performance even with a high rate of faults.