Edge computing has recently emerged as a promising paradigm to boost the performance of distributed learning by leveraging the distributed resources at edge nodes. Architecturally, the introduction of edge nodes adds an additional intermediate layer between the master and workers in the original distributed learning systems, potentially leading to more severe straggler effect. Recently, coding theory-based approaches have been proposed for stragglers mitigation in distributed learning, but the majority focus on the conventional workers-master architecture. In this paper, along a different line, we investigate the problem of mitigating the straggler effect in hierarchical distributed learning systems with an additional layer composed of edge nodes. Technically, we first derive the fundamental trade-off between the computational loads of workers and the stragglers tolerance. Then, we propose a hierarchical gradient coding framework, which provides better stragglers mitigation, to achieve the derived computational trade-off. To further improve the performance of our framework in heterogeneous scenarios, we formulate an optimization problem with the objective of minimizing the expected execution time for each iteration in the learning process. We develop an efficient algorithm to mathematically solve the problem by outputting the optimum strategy. Extensive simulation results demonstrate the superiority of our schemes compared with conventional solutions.
翻译:边缘计算作为一种新兴的范式,通过利用边缘节点的分布式资源来提升分布式学习的性能,近年来展现出广阔前景。在架构层面,边缘节点的引入在原有分布式学习系统的主节点与工作节点之间增加了额外的中间层,可能导致更严重的掉队者效应。近期,基于编码理论的方法已被提出用于缓解分布式学习中的掉队者问题,但现有研究大多聚焦于传统的工作节点-主节点架构。本文则沿着不同方向,研究了在包含边缘节点附加层的分层分布式学习系统中缓解掉队者效应的问题。技术上,我们首先推导了工作节点计算负载与掉队者容忍度之间的基本权衡关系。随后,我们提出了一种分层梯度编码框架,该框架能提供更优的掉队者缓解能力,以实现所推导的计算权衡。为了进一步提升该框架在异构场景下的性能,我们构建了一个以最小化学习过程中每次迭代的期望执行时间为目标的优化问题。我们开发了一种高效算法,通过输出最优策略从数学上求解该问题。大量仿真实验结果证明了所提方案相较于传统解决方案的优越性。