Session – PASC Program

· Contributors · Organizations · Search Program · Flagged

Minisymposium: MS13 - Programming Models to Enable Scalable Resilience for Extreme Scale Computing Systems

Session Chairs

Hemanth Kolla

Sandia National Laboratories

Aurelien Bouteiller

University of Tennessee

Keita Teranishi

Sandia National Laboratories

Event TypeMinisymposium

Scientific Fields

TimeWednesday, 12 June 201915:30 - 17:30

LocationHG E 3

DescriptionWith growing scale and complexity of computational systems, HPC applications are increasingly susceptible to a wide variety of hardware and software faults, making failure mitigations at the runtime and application layers more essential. Resilience has become a first citizen to enable productive use of extreme scale HPC systems. Today, the major application-level resilience scheme is coordinated checkpoint and restart (C/R) that involves global coordination of processes and threads. Despite the recent progress in I/O technology and the emergence of efficient C/R techniques, this global recovery model entails inherent scalability issues, given that the majority of failures happen at a single process or node (local failure). Recently, several alternative approaches have been proposed to enable localized response to local failures, but their feasibility are yet to be studied. In this minisymposium, we will discuss the recent progress of runtime and library approaches for extreme-scale resilience, including the state-of-art C/R, new fault tolerance proposal of MPI, and localized recovery model facilitated by emerging asynchronous many task parallel programming model.

Presentations

15:30 - 16:00	FA-MPI: Using a Parallel Transactional Model of Fault Tolerance and Statistical Consensus for the Message Passing Interface Authors Anthony Skjellum Purushotham Bangalore Derek Schafer Amin Hassani Sheikh Ghafoor	Computer Science and Applied Mathematics
16:00 - 16:30	Programming Model Design Tradeoffs of Global vs. Local Recovery for Algorithm Based Fault Tolerance Authors Hemanth Kolla Keita Teranishi Jackson Mayo Maher Salloum Rob Armstrong	Computer Science and Applied Mathematics
16:30 - 17:00	VeloC: Very Low Overhead Checkpointing System Author Bogdan Nicolae	Computer Science and Applied Mathematics
17:00 - 17:30	Open Discussion on Programming Models to Enable Scalable Resilience Author Keita Teranishi	Computer Science and Applied Mathematics

Back to PASC19 Home