Design and Optimization of Heterogeneous Coded Distributed Computing with Nonuniform File Popularity

This paper studies MapReduce-based heterogeneous coded distributed computing (CDC) where, besides different computing capabilities at workers, input files to be accessed by computing jobs have nonuniform popularity. We propose a file placement strategy that can handle an arbitrary number of input files. Furthermore, we design a nested coded shuffling strategy that can efficiently manage the nonuniformity of file popularity to maximize the coded multicasting opportunity. We then formulate the joint optimization of the proposed file placement and nested shuffling design variables to optimize the proposed CDC scheme. To reduce the high computational complexity in solving the resulting mixed-integer linear programming (MILP) problem, we propose a simple two-file-group-based file placement approach to obtain an approximate solution. Numerical results show that the optimized CDC scheme outperforms other alternatives. Also, the proposed two-file-group-based approach achieves nearly the same performance as the conventional branch-and-cut method in solving the MILP problem but with substantially lower computational complexity that is scalable over the number of files and workers. For computing jobs with aggregate target functions that commonly appear in machine learning applications, we propose a heterogeneous compressed CDC (C-CDC) scheme to further improve the shuffling efficiency. The C-CDC scheme uses a local data aggregation technique to compress the data to be shuffled for the shuffling load reduction. We again optimize the proposed C-CDC scheme and explore the two-file-group-based low-complexity approach for an approximate solution. Numerical results show the proposed C-CDC scheme provides a considerable shuffling load reduction over the CDC scheme, and also, the two-file-group-based file placement approach maintains good performance.

翻译：本文研究基于MapReduce的异构编码分布式计算（CDC），其中工作节点不仅具有不同的计算能力，而且计算任务需要访问的输入文件具有非均匀的流行度。我们提出了一种能够处理任意数量输入文件的文件放置策略。此外，我们设计了一种嵌套编码混洗策略，该策略能够高效管理文件流行度的非均匀性，从而最大化编码多播机会。随后，我们将所提出的文件放置策略与嵌套混洗设计方案进行联合优化，以优化所提出的CDC方案。为降低求解由此产生的混合整数线性规划（MILP）问题的高计算复杂度，我们提出了一种简单的基于两文件组分类的文件放置方法来获取近似解。数值结果表明，优化后的CDC方案性能优于其他替代方案。同时，所提出的基于两文件组的方法在求解MILP问题时达到了与传统分支定界法几乎相同的性能，但其计算复杂度显著降低，且可随文件数量和工作节点数量扩展。针对机器学习应用中常见的具有聚合目标函数的计算任务，我们提出了一种异构压缩CDC（C-CDC）方案以进一步提升混洗效率。该C-CDC方案采用局部数据聚合技术压缩待混洗数据，从而降低混洗负载。我们再次对提出的C-CDC方案进行优化，并探索了基于两文件组的低复杂度方法以获得近似解。数值结果表明，与CDC方案相比，所提出的C-CDC方案能显著降低混洗负载，同时基于两文件组的文件放置方法仍能保持良好的性能。