The broad development and usage of edge devices has highlighted the importance of creating resilient and computationally advanced environments. When working with edge devices these desiderata are usually achieved through replication and offloading. This paper reports on the design and implementation of Workrs, a fault tolerant service that enables the offloading of jobs from devices with limited computational power. We propose a solution that allows users to upload jobs through a web service, which will be executed on edge nodes within the system. The solution is designed to be fault tolerant and scalable, with no single point of failure as well as the ability to accommodate growth, if the service is expanded. The use of Docker checkpointing on the worker machines ensures that jobs can be resumed in the event of a fault. We provide a mathematical approach to optimize the number of checkpoints that are created along a computation, given that we can forecast the time needed to execute a job. We present experiments that indicate in which scenarios checkpointing benefits job execution. The results achieved are based on a working prototype which shows clear benefits of using checkpointing and restore when the completion jobs' time rises compared with the forecast fault rate. The code of Workrs is released as open source, and it is available at \url{https://github.com/orgs/P7-workrs/repositories}. This paper is an extended version of \cite{edge2023paper}.
翻译:边缘设备的广泛开发与应用凸显了构建弹性且高性能计算环境的重要性。在与边缘设备协作时,这些目标通常通过复制与卸载技术实现。本文报告了Workrs(一种实现从计算能力受限设备卸载任务的容错服务)的设计与实现。我们提出了一种解决方案,使用户能通过Web服务上传任务,这些任务将在系统内的边缘节点上执行。该方案具有容错性与可扩展性,既无单点故障,又能适应服务扩展时的规模增长。通过在工作者机器上应用Docker检查点技术,确保任务在发生故障时可恢复。针对可预测任务执行时间的情况,我们提出了一种数学方法以优化计算过程中创建的检查点数量。实验表明,检查点技术能提升特定场景下的任务执行效率。基于工作原型的测试结果清晰显示:当任务完成时间相较于预测故障率增加时,采用检查点与恢复机制具有显著优势。Workrs代码已开源,可通过\url{https://github.com/orgs/P7-workrs/repositories}获取。本文为\cite{edge2023paper}的扩展版本。