Author: Allan S. Nielsen, Ecole polytechnique fédérale de Lausanne (EPFL)

In the quest towards reaching Exascale computational throughput on potentially Exascale capable machines, dealing with faulty hardware is expected to be a major challenge. On todays Petascale systems, hardware failures of various forms have already become something of a daily occurrence.

On a large cluster, the failure of a single node, or possibly a power supply unit supplying a small set of nodes, might not seem like a big deal given that the machine may have up to several thousand nodes. Consider this however, most codes for computational science and engineering use MPI for passing messages between nodes, and in the current world of MPI, the failure of a single rank leads to failure of the entire application, thereby affecting the computational progress of all other nodes used in a job, even the ones that were originally running just fine.

This in turn leads to the application scientist occasionally needing to rerun an entire job due to the failure of a single node. And as if this wasn't problematic enough, the probability of having to do such a rerun increases with the size of the job. The current approach widely used to mitigate the issue of wasted compute cycles due to component failures is to have applications checkpoint all relevant data needed for a restart at regular intervals. Doing so, in the event of a failure, the application may simple be restarted from the most recent checkpoint rather than recomputing everything. This approach has worked well for a long time, and a simple high order formula exists for estimating the optimal checkpoint interval in such a setup [Dal06]

In the above expression, δ denotes the time it takes to create a checkpoint and M the mean time between failures. This fairly simple approach is already reaching a scaling limit though. On large jobs on large machines, it is not uncommon for applications to spend 10-20 percent of the time creating checkpoints on the parallel file system. As the number of nodes involved in a job goes up, even if we assume delta to be constant, the fraction of time spent checkpointing with respect to time spent computing will overall increase due to the shorter mean time between failure M. In addition, the expected number of failure-restarts that the job will experience will also increase with machine size, which further decreases the overall effciency of the used resources. In the Exascale limit of current Petascale systems, it has been suggested that things are going to get far worse to the extent that applications wont even be able to make progress due to being in a constant state of checkpointing to, or restarting from, the parallel file system.

Efforts have been made to develop idea's that improve upon the basic checkpointrestart approach. Statistics collected on general purpose compute clusters has indicated that the majority, around 80% of failures, involve only a single node. And multiple nodes often fail in a predictable pattern, I.e., the failure of shared resources. Thus, a potential scalable approach to checkpoint-restart is to introduce cheaper, less resilient checkpoints, that may recover from the loss of a small subset of nodes. Such lightweight checkpoints could be sampled frequently and used in combination with more expensive and resilient conventional checkpoints made on the parallel file system. In this scenario, the conventional parallel file system checkpoints would not need to be sampled as frequently since, statistically, most failures can be recovered from using the protection that the lightweight checkpoints offer [MBMS10].

As part of the ExaFlow project, we're making a library for the creation of lightweight in-memory checkpoints. Checkpoints are constructed using Reed Solomon erasure code across groups of nodes of some size specified by the user. Within a group, as many nodes as the checkpoint has checksums may be may be lost without compromising the data protected within that checkpoint. The data to be protected is stored locally in-memory, and the checksum codes generated are stored distributed, in-memory, across all nodes within a group. Distributing the codes is achieved by splitting the data, to be protected and encoded, into smaller chunks or "stripes". The memory overhead for storing the checksum codes scale as

where n is the number of nodes in a checkpoint and m the number of checksums. Encoding and decoding is performed in parallel across nodes in a group using MPI along with functionality available in the c library GF-Complete to perform the Galois Field arithmetic involved [PG14]. The parallel encoding overhead is proportional to 

 

whilst the decoding overhead scales as

Figure 1: Time to create a checkpoint of a 16GB distributed array as a function of the number of mpi-ranks on which the array is distributed.

The approach of protecting the data is essentially the same as with Raid 5 or 6, except that with pcheckpoint one may create light-weight checkpoints with arbitrary group sizes with any number of checksums, and crucially, everything is kept in-memory. Figure 1 contains preliminary measurements on the scaling of pcheckpoint on the EPFL Fidis cluster. Time to encode is measured as a function of cores when encoding a 16GB distributed array. With 2000 cores, pcheckpoint is able to encode-distribute-protect data at a rate of more than 100GB/s.

We envision that pcheckpoint may be used to create a true multi-level hierarchical set of checkpoints for use on Exascale machines. Currently, we're working on integrating support for ULFM MPI into pcheckpoint so that the library does not only support in-memory protection of data, but also automatic recovery and restart. Once this has been completed, we will document it's performance and use on Nektar++ and release the code for the community to use.

If you're interested in the work we're doing and would like to know more, do not hesitate to get in touch with Allan S. Nielsen at EPFL, or Chris Cantwell at Imperial College.


References

[Dal06] John T Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future generation computer systems, 22(3):303-312, 2006.

[MBMS10] Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R de Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, pages 1-11. IEEE Computer Society, 2010.

[PG14] James S Plank and Kevin M Greenan. Jerasure: A library in c facilitating erasure coding for storage applicationsversion 2.0. Technical report, Technical Report UT-EECS-14-721, University of Tennessee, 2014.