Failures in Task-based Parallel Programming (TBPP) can severely degrade performance and result in incomplete or incorrect outcomes. Existing failure-handling approaches, including reactive, proactive, and resilient methods such as retry and checkpointing mechanisms, often apply uniform retry mechanisms regardless of the root cause of failures, failing to account for the unique characteristics of TBPP frameworks such as heterogeneous resource availability and task-level failures. To address these limitations, we propose WRATH, a novel systematic approach that categorizes failures based on the unique layered structure of TBPP frameworks and defines specific responses to address failures at different layers. WRATH combines a distributed monitoring system and a resilient module to collaboratively address different types of failures in real time. The monitoring system captures execution and resource information, reports failures, and profiles tasks across different layers of TBPP frameworks. The resilient module then categorizes failures and responds with appropriate actions, such as hierarchically retrying failed tasks on suitable resources. Evaluations demonstrate that WRATH significantly improves TBPP robustness, tripling the task success rate and maintaining an application success rate of over 90% for resolvable failures. Additionally, WRATH can reduce the time to failure by 20%-50%, allowing tasks that are destined to fail to be identified and fail more quickly.
翻译:基于任务的并行编程(TBPP)中的故障会严重降低性能,并导致不完整或错误的结果。现有的故障处理方法,包括反应式、主动式和弹性方法(如重试和检查点机制),通常无论故障的根本原因如何都采用统一的重试机制,未能考虑到TBPP框架的独特特性,例如异构资源可用性和任务级故障。为了解决这些局限性,我们提出了WRATH,这是一种新颖的系统性方法,它根据TBPP框架独特的分层结构对故障进行分类,并定义特定的响应来处理不同层次的故障。WRATH结合了分布式监控系统和弹性模块,以协同实时处理不同类型的故障。监控系统捕获执行和资源信息,报告故障,并对TBPP框架不同层次的任务进行分析。然后,弹性模块对故障进行分类,并采取适当的措施进行响应,例如在合适的资源上分层重试失败的任务。评估表明,WRATH显著提高了TBPP的鲁棒性,将任务成功率提高了三倍,并为可解决的故障保持了超过90%的应用程序成功率。此外,WRATH可以将故障发生时间减少20%-50%,从而能够识别注定要失败的任务并使其更快地失败。