Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing (1994)
Austin, TX, USA
June 15, 1994 to June 17, 1994
D. Cummings , Jet Propulsion Lab., California Inst. of Technol., Pasadena, CA, USA
L. Alkalaj , Jet Propulsion Lab., California Inst. of Technol., Pasadena, CA, USA
The Common Spaceborne Multicomputer Operating System (COSMOS) is a spacecraft operating system for distributed memory multiprocessors, designed to meet the on-board computing requirements of long-life interplanetary missions. One of the main features of COSMOS is software-implemented fault-tolerance, including 2-way voting, 3-way voting, and check point/rollback. This paper describes the COSMOS distributed checkpoint/rollback approach, which exploits the fact that a COSMOS application program is based on a coarse-grained dataflow programming paradigm and therefore most of the state of a distributed application program is contained in the data tokens. Furthermore, all computers maintain a consistent view of this dynamic state, which facilitates the implementation of a coordinated checkpoint.<
distributed memory systems, space vehicles, parallel processing, fault tolerant computing, software reliability, operating systems (computers), aerospace computing, concurrency control
D. Cummings and L. Alkalaj, "Checkpoint/rollback in a distributed system using coarse-grained dataflow," Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing(FTCS), Austin, TX, USA, 1994, pp. 424-433.