异构编码分布式计算中非均匀文件流行度的设计与优化 (Design and Optimization of Heterogeneous Coded Distributed Computing with Nonuniform File Popularity)

This paper studies MapReduce-based heterogeneous coded distributed computing (CDC) where, besides different computing capabilities at workers, input files to be accessed by computing jobs have nonuniform popularity. We propose a file placement strategy that can handle an arbitrary number of input files. Furthermore, we design a nested coded shuffling strategy that can efficiently manage the nonuniformity of file popularity to maximize the coded multicasting opportunity. We then formulate the joint optimization of the proposed file placement and nested shuffling design variables to optimize the proposed CDC scheme. To reduce the high computational complexity in solving the resulting mixed-integer linear programming (MILP) problem, we propose a simple two-file-group-based file placement approach to obtain an approximate solution. Numerical results show that the optimized CDC scheme outperforms other alternatives. Also, the proposed two-file-group-based approach achieves nearly the same performance as the conventional branch-and-cut method in solving the MILP problem but with substantially lower computational complexity that is scalable over the number of files and workers. For computing jobs with aggregate target functions that commonly appear in machine learning applications, we propose a heterogeneous compressed CDC (C-CDC) scheme to further improve the shuffling efficiency. The C-CDC scheme uses a local data aggregation technique to compress the data to be shuffled for the shuffling load reduction. We again optimize the proposed C-CDC scheme and explore the two-file-group-based low-complexity approach for an approximate solution. Numerical results show the proposed C-CDC scheme provides a considerable shuffling load reduction over the CDC scheme, and also, the two-file-group-based file placement approach maintains good performance.

翻译：本文研究基于MapReduce的异构编码分布式计算(CDC)，其中除了工作节点具有不同的计算能力外，计算任务需访问的输入文件还具有非均匀的流行度。我们提出了一种可处理任意数量输入文件的文件放置策略。此外，我们设计了一种嵌套编码混洗策略，该策略能有效管理文件流行度的非均匀性，以最大化编码组播机会。随后，我们对所提出的文件放置与嵌套混洗设计变量进行联合优化，以优化所提出的CDC方案。为降低求解所得混合整数线性规划(MILP)问题的高计算复杂度，我们提出了一种基于双文件组的简单文件放置方法来获得近似解。数值结果表明，优化后的CDC方案性能优于其他方案。同时，所提出的基于双文件组的方法在求解MILP问题时，其性能与传统的分支切割法几乎相同，但计算复杂度显著降低，且可随文件和工作者数量扩展。针对机器学习应用中常见的聚合目标函数计算任务，我们进一步提出了一种异构压缩CDC(C-CDC)方案以提升混洗效率。该C-CDC方案采用本地数据聚合技术压缩待混洗数据，从而降低混洗负载。我们再次优化了所提出的C-CDC方案，并探索了基于双文件组的低复杂度近似求解方法。数值结果表明，所提出的C-CDC方案相比CDC方案能显著降低混洗负载，且基于双文件组的文件放置方法仍能保持良好的性能。