The Community for Technology Leaders
Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing (1994)
Austin, TX, USA
June 15, 1994 to June 17, 1994
ISBN: 0-8186-5520-8
pp: 38-47
A. Roy-Chowdhury , Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
P. Banerjee , Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
ABSTRACT
Previous algorithm-based methods for developing reliable versions of numerical algorithms have mostly concerned themselves with error detection. A truly fault tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In our paper, we discuss in detail fault tolerant version of a matrix multiplication algorithm. The ideas developed in the derivation of the fault-tolerant matrix multiplication algorithms may be used to derive fault-tolerant versions of other numerical algorithms. We outline how two other numerical algorithms, QR factorization and Gaussian Elimination may be made fault-tolerant using our approach. Our fault model assumes that a faulty processor can corrupt all the data it possesses. We present error coverage and overhead results for the single faulty processor case for fault-locating and fault-tolerant versions of three numerical algorithms on an Intel iPSC/2 hypercube multicomputer.<>
INDEX TERMS
matrix algebra, fault tolerant computing, parallel algorithms, system recovery
CITATION

A. Roy-Chowdhury and P. Banerjee, "Algorithm-based fault location and recovery for matrix computations," Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing(FTCS), Austin, TX, USA, 1994, pp. 38-47.
doi:10.1109/FTCS.1994.315659
93 ms
(Ver 3.3 (11022016))