Skip to end of metadata
Go to start of metadata

Fault Tolerance in High Performance Computing Reading Group

Welcome to the Reading Group in Fault Tolerance for High Performance Computing. This space has a few goals:

  • Discuss the latest and most influential papers on Fault Tolerance in High Performance Computing.
  • Leverage the collaboration between different research groups working on this topic.
  • Provide feedback for those currently developing systems and algorithms to make supercomputing more resilient.

The reading group meets every other Wednesday at 4:00pm in room 3102 (Siebel Center).

Quick Info

Where? Siebel Center 3102
When? Wednesday 4:00 - 5:00 pm
Next meeting: Wednesday, May 25


Candidate Papers for Future Discussions

Title Authors Link
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Z. Chen , G.E. Fagg , E. Gabriel
J. Langou , T. Angskun , G. Bosilca
J. Dongarra
A large-scale study of failures in high-performance computing systems B. Schroeder, G A. Gibson PDF

Predicting Node Failure in High Performance Computing Systems fom Failure and Usage Logs
Nithin Nakka, Ankit Agrawal, Alok Choudhary


If you are in charge of presenting a paper for the next session, please follow these steps:

  1. Prepare a short presentation (5-7 slides) about the paper. Include a summary of the main ideas, highlighting the strong and weak points. Pay attention to things like the clarity in the description, the contribution, the quality of experiments and the impact in terms of future work.
  2. Go to section "Sessions" page by clicking the link at the bottom of this page.
  3. Create a page named as the actual date of the session.
  4. Copy the template from the page "Meeting Page Template" (find a link at the bottom of this page).
  5. Fill in all the fields of the template.
  6. Add a link to the page below the title "Sessions" on this page.


Most of the templates for the pages on this website were extracted from the "Parallel Reading Group" website.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.