Rollback recovery strategies are well-known in concurrent and distributed systems. In this context, recovering from unexpected failures is even more relevant given the non-deterministic nature of execution, which means that it is practically impossible to foresee all possible process interactions. In this work, we consider a message-passing concurrent programming language where processes interact through message sending and receiving, but shared memory is not allowed. In this context, we design a checkpoint-based rollback recovery strategy that does not need a central coordination. For this purpose, we extend the language with three new operators: check, commit, and rollback. Furthermore, our approach is purely asynchronous, which is an essential ingredient to developing a source-to-source program instrumentation implementing a rollback recovery strategy.
翻译:回滚恢复策略在并发与分布式系统中广为人知。在此背景下,由于执行过程具有非确定性特征——实际上不可能预见所有可能的进程交互,因此从意外故障中恢复显得尤为重要。本研究考虑一种基于消息传递的并发编程语言,其中进程通过消息发送与接收进行交互,但不允许使用共享内存。在此框架下,我们设计了一种无需中心协调的基于检查点的回滚恢复策略。为此,我们为语言扩展了三个新算子:check(检查)、commit(提交)与rollback(回滚)。此外,我们的方法完全采用异步机制,这是实现回滚恢复策略的源到源程序插桩技术的关键要素。