The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of their current states regularly, so that a rollback recovery strategy is able to bring the system back to a previous consistent state whenever a failure occurs. In this paper, we consider a message-passing concurrent programming language and propose a novel rollback recovery strategy that is based on some explicit checkpointing operators and the use of a (partially) reversible semantics for rolling back the system.
翻译:并发和分布式系统的可靠性通常依赖于一些众所周知的容错技术。其中一项技术基于检查点设置和回滚恢复。检查点涉及进程定期对其当前状态进行快照,以便一旦发生故障,回滚恢复策略能够将系统恢复到之前的某个一致状态。在本文中,我们考虑一种消息传递并发编程语言,并提出一种新颖的回滚恢复策略,该策略基于显式检查点操作符以及使用(部分)可逆语义来实现系统的回滚。