Machine learning (ML) tasks are one of the major workloads in today's edge computing networks. Existing edge-cloud schedulers allocate the requested amounts of resources to each task, falling short of best utilizing the limited edge resources for ML tasks. This paper proposes TapFinger, a distributed scheduler for edge clusters that minimizes the total completion time of ML tasks through co-optimizing task placement and fine-grained multi-resource allocation. To learn the tasks' uncertain resource sensitivity and enable distributed scheduling, we adopt multi-agent reinforcement learning (MARL) and propose several techniques to make it efficient, including a heterogeneous graph attention network as the MARL backbone, a tailored task selection phase in the actor network, and the integration of Bayes' theorem and masking schemes. We first implement a single-task scheduling version, which schedules at most one task each time. Then we generalize to the multi-task scheduling case, in which a sequence of tasks is scheduled simultaneously. Our design can mitigate the expanded decision space and yield fast convergence to optimal scheduling solutions. Extensive experiments using synthetic and test-bed ML task traces show that TapFinger can achieve up to 54.9% reduction in the average task completion time and improve resource efficiency as compared to state-of-the-art schedulers.
翻译:机器学习(ML)任务是当今边缘计算网络中的主要工作负载之一。现有边缘云调度器为每个任务分配请求数量的资源,未能充分利用有限的边缘资源来优化ML任务性能。本文提出TapFinger——一种面向边缘集群的分布式调度器,通过协同优化任务放置与细粒度多资源分配,最小化ML任务的总完成时间。为学习任务的不确定性资源敏感性并实现分布式调度,我们采用多智能体强化学习(MARL),并提出多项技术提升其效率,包括:作为MARL骨干网络的异构图注意力网络、演员网络中定制的任务选择阶段,以及贝叶斯定理与掩码方案的集成。我们首先实现单任务调度版本(每次最多调度一个任务),随后推广至多任务调度场景(同时调度一系列任务)。该设计可缓解决策空间膨胀问题,并快速收敛至最优调度方案。基于合成与测试床ML任务轨迹的大量实验表明,与现有最先进调度器相比,TapFinger可实现平均任务完成时间降低最高达54.9%,并提升资源利用率。