In large distributed systems, failures are a daily event occurring frequently, especially with growing numbers of computation tasks and locations on which they are deployed. The advantage of representing an application with a workflow is the possibility of exploiting Workflow Management System (WMS) features such as portability. A relevant feature that some WMSs supply is reliability. Over recent years, the emergence of hybrid workflows has posed new and intriguing challenges by increasing the possibility of distributing computations involving heterogeneous and independent environments. Consequently, the number of possible points of failure in the execution increased, creating different important challenges that are interesting to study. This paper presents the implementation of a fault tolerance mechanism for hybrid workflows based on the recovery and rollback approach. A representation of the hybrid workflows with the formal framework is provided, together with the experiments demonstrating the functionality of implementing approach.
翻译:在大型分布式系统中,故障是频繁发生的日常事件,尤其是随着计算任务数量及其部署位置的不断增长。采用工作流表示应用程序的优势在于能够利用工作流管理系统(WMS)的特性,例如可移植性。部分WMS提供的一项重要特性是可靠性。近年来,混合工作流的出现通过增加在异构且独立的环境中分布计算的可能性,带来了新颖且引人入胜的挑战。因此,执行过程中潜在的故障点数量随之增加,产生了多个值得研究的重要挑战。本文提出了一种基于恢复与回滚方法的混合工作流容错机制实现。文中提供了使用形式化框架对混合工作流的表示方法,并通过实验验证了所实现机制的功能性。