BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20190719T085743Z
LOCATION:HG E 3
DTSTART;TZID=Europe/Stockholm:20190612T153000
DTEND;TZID=Europe/Stockholm:20190612T160000
UID:submissions.pasc-conference.org_PASC19_sess154_msa316@linklings.com
SUMMARY:FA-MPI: Using a Parallel Transactional Model of Fault Tolerance an
 d Statistical Consensus for the Message Passing Interface
DESCRIPTION:Minisymposium\nComputer Science and Applied Mathematics, Emerg
 ing Application Domains, Chemistry and Materials, Climate and Weather, Phy
 sics, Solid Earth Dynamics, Life Sciences, Engineering\n\nFA-MPI: Using a 
 Parallel Transactional Model of Fault Tolerance and Statistical Consensus 
 for the Message Passing Interface\n\nSkjellum, Bangalore, Schafer, Hassani
 , Ghafoor\n\nWe overview FA-MPI, which differs from leading efforts like U
 LFM; we address the "non-blocking" functionality of MPI (e.g., MPI_Ibcast)
 . Non-blocking MPI-4 comprises a comprehensive subset of MPI; all blocking
  functions can be layered. We assert that exascale codes will emphasize no
 n-blocking anyway. We don't attempt 100% API coverage presently; only
  message-passing APIs are addressed. We introduce a collective TRY-CATCH p
 arallel containment section, in which MPI, the application, and external a
 gents can coalesce failure information. TRY-CATCH supports lexical delimit
 ers at the end of which group-wide consensus can be established.  The
 se provide a mechanism for fault-injection for testing. We treat the 
 outcomes of TRY-CATCH blocks as "consensus of success," "consensus of N fa
 ilures," or "inability to reach consensus." The latter appeals to hierarch
 ical recovery (e.g., fail backward with CPR). Successful consensus can be 
 used for applications to build alternatives to CPR, such as fail forward, 
 message logging, or selective recomputation and recommunication. Fault mod
 els are not limited to process failure. The FA-MPI prototype is described.
  Our approach to fault-detection is discussed. Newest efforts to explore n
 etwork faults are mentioned. Minimization of fault-free overhead is mentio
 ned. We outline our next effort, merging FA-MPI with ExaMPI/Stages, t
 o enable multiple fault models in one MPI product.
END:VEVENT
END:VCALENDAR

