Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing (1994)
Austin, TX, USA
June 15, 1994 to June 17, 1994
A. Roy-Chowdhury , Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
P. Banerjee , Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
Previous algorithm-based methods for developing reliable versions of numerical algorithms have mostly concerned themselves with error detection. A truly fault tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In our paper, we discuss in detail fault tolerant version of a matrix multiplication algorithm. The ideas developed in the derivation of the fault-tolerant matrix multiplication algorithms may be used to derive fault-tolerant versions of other numerical algorithms. We outline how two other numerical algorithms, QR factorization and Gaussian Elimination may be made fault-tolerant using our approach. Our fault model assumes that a faulty processor can corrupt all the data it possesses. We present error coverage and overhead results for the single faulty processor case for fault-locating and fault-tolerant versions of three numerical algorithms on an Intel iPSC/2 hypercube multicomputer.<
matrix algebra, fault tolerant computing, parallel algorithms, system recovery
A. Roy-Chowdhury and P. Banerjee, "Algorithm-based fault location and recovery for matrix computations," Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing(FTCS), Austin, TX, USA, 1994, pp. 38-47.