Session

Minisymposium: MS13 - Programming Models to Enable Scalable Resilience for Extreme Scale Computing Systems
Event TypeMinisymposium
Scientific Fields
Computer Science and Applied Mathematics
Emerging Application Domains
Chemistry and Materials
Climate and Weather
Physics
Solid Earth Dynamics
Life Sciences
Engineering
TimeWednesday, 12 June 201915:30 - 17:30
LocationHG E 3
DescriptionWith growing scale and complexity of computational systems, HPC applications are increasingly susceptible to a wide variety of hardware and software faults, making failure mitigations at the runtime and application layers more essential. Resilience has become a first citizen to enable productive use of extreme scale HPC systems. Today, the major application-level resilience scheme is coordinated checkpoint and restart (C/R) that involves global coordination of processes and threads. Despite the recent progress in I/O technology and the emergence of efficient C/R techniques, this global recovery model entails inherent scalability issues, given that the majority of failures happen at a single process or node (local failure). Recently, several alternative approaches have been proposed to enable localized response to local failures, but their feasibility are yet to be studied. In this minisymposium, we will discuss the recent progress of runtime and library approaches for extreme-scale resilience, including the state-of-art C/R, new fault tolerance proposal of MPI, and localized recovery model facilitated by emerging asynchronous many task parallel programming model.