Supercomputers getting ever larger and energy-efficient is at odds with the reliability of the used hardware. Thus, the time intervals between component failures are decreasing. Contrarily, the latencies for individual operations of coarse-grained big-data tools grow with the number of processors. To overcome the resulting scalability limit, we need to go beyond the current practice of interoperation checkpointing. We give first results on how to achieve this for the popular MapReduce framework where huge multisets are processed by user-defined mapping and reducing functions. We observe that the full state of a MapReduce algorithm is described by its network communication. We present a low-overhead technique with no additional work during fault-free execution and the negligible expected relative communication overhead of $1/(p-1)$ on $p$ PEs. Recovery takes approximately the time of processing $1/p$ of the data on the surviving PEs. We achieve this by backing up self-messages and locally storing all messages sent through the network on the sending and receiving PEs until the next round of global communication. A prototypical implementation already indicates low overhead $<4\,\%$ during fault-free execution.
翻译:超级计算机规模日益增大且能效提升,这与所用硬件的可靠性存在矛盾。因此,组件故障的时间间隔正在缩短。相反,粗粒度大数据工具中单个操作的延迟随着处理器数量的增加而增长。为克服由此产生的可扩展性限制,我们需要超越当前跨操作检查点的实践方法。本文针对流行的MapReduce框架(该框架通过用户定义的映射和归约函数处理大型多重集)首次提出实现该目标的方案。我们观察到MapReduce算法的完整状态可由其网络通信描述。我们提出一种低开销技术,该技术在无故障执行期间不产生额外工作,且在p个处理单元上仅产生可忽略的预期相对通信开销$1/(p-1)$。恢复时间约等于在存活处理单元上处理$1/p$数据所需时间。这是通过备份自发送消息并在发送/接收处理单元本地存储所有网络传输消息直至下一轮全局通信实现的。原型实现已表明在无故障执行期间可实现低于$4\,\%$的低开销。