Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing (1994)
Austin, TX, USA
June 15, 1994 to June 17, 1994
J.S. Plank , Dept. of Comput. Sci., Tennessee Univ., Knoxville, TN, USA
This paper presents a way to perform fast incremental checkpointing of multicomputers and distributed systems by using N+1 parity. A basic algorithm is described that uses two extra processors for checkpointing and enables the system to tolerate any single processor failure. The algorithm's speed comes from a combination of N+1 parity, extra physical memory, and virtual memory hardware so that checkpoints need not be written to disk. This eliminates the most time-consuming portion of checkpointing. The algorithm requires each application processor to allocate a fixed amount of extra memory for checkpointing. This amount may be set statically by the programmer, and need not be equal to the site of the processor's writable address space. This alleviates a major restriction of previous checkpointing algorithms using N+1 parity. Finally, we outline how to extend our algorithm to tolerate any m processor failures with the addition of 2m extra checkpointing processors.<
fault tolerant computing, reliability, distributed processing, virtual storage
J. Plank and Kai Li, "Faster checkpointing with N+1 parity," Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing(FTCS), Austin, TX, USA, 1994, pp. 288-297.