Distributed computing, in which a resource-intensive task is divided into subtasks and distributed among different machines, plays a key role in solving large-scale problems. Coded computing is a recently emerging paradigm where redundancy for distributed computing is introduced to alleviate the impact of slow machines (stragglers) on the completion time. We investigate coded computing solutions over elastic resources, where the set of available machines may change in the middle of the computation. This is motivated by recently available services in the cloud computing industry (e.g., EC2 Spot, Azure Batch) where low-priority virtual machines are offered at a fraction of the price of the on-demand instances but can be preempted on short notice. Our contributions are three-fold. We first introduce a new concept called transition waste that quantifies the number of tasks existing machines must abandon or take over when a machine joins/leaves. We then develop an efficient method to minimize the transition waste for the cyclic task allocation scheme recently proposed in the literature (Yang et al. ISIT'19). Finally, we establish a novel solution based on finite geometry achieving zero transition wastes given that the number of active machines varies within a fixed range.
翻译:分布式计算将资源密集型任务分解为子任务并分发至不同机器,在解决大规模问题中发挥着关键作用。编码计算是近年兴起的一种范式,通过引入分布式计算的冗余来缓解慢速机器(掉队者)对完成时间的影响。本文研究弹性资源环境下的编码计算解决方案,其中可用机器集合可能在计算过程中发生动态变化。这一需求源于云计算领域近期提供的服务(如EC2 Spot、Azure Batch),这些服务以远低于按需实例的价格提供低优先级虚拟机,但可能随时被抢占。我们的贡献包括三个方面。首先,提出称为过渡浪费的新概念,用于量化机器加入/退出时现有机器需放弃或接管的任务数量。其次,针对近期文献中提出的循环任务分配方案(Yang等, ISIT'19),开发出最小化过渡浪费的高效方法。最后,基于有限几何建立创新解决方案,当活跃机器数量在固定范围内动态变化时,可实现零过渡浪费。