Resilience against stragglers is a critical element of prediction serving systems, tasked with executing inferences on input data for a pre-trained machine-learning model. In this paper, we propose NeRCC, as a general straggler-resistant framework for approximate coded computing. NeRCC includes three layers: (1) encoding regression and sampling, which generates coded data points, as a combination of original data points, (2) computing, in which a cluster of workers run inference on the coded data points, (3) decoding regression and sampling, which approximately recovers the predictions of the original data points from the available predictions on the coded data points. We argue that the overall objective of the framework reveals an underlying interconnection between two regression models in the encoding and decoding layers. We propose a solution to the nested regressions problem by summarizing their dependence on two regularization terms that are jointly optimized. Our extensive experiments on different datasets and various machine learning models, including LeNet5, RepVGG, and Vision Transformer (ViT), demonstrate that NeRCC accurately approximates the original predictions in a wide range of stragglers, outperforming the state-of-the-art by up to 23%.
翻译:抵抗掉队者是预测服务系统中的关键要素,这类系统需对预训练机器学习模型的输入数据进行推理计算。本文提出NeRCC——一种通用的抗掉队者近似编码计算框架。NeRCC包含三个层级:(1) 编码回归与采样层,通过原始数据点的组合生成编码数据点;(2) 计算层,由计算集群对编码数据点执行推理任务;(3) 解码回归与采样层,利用编码数据点上的可用预测结果近似恢复原始数据点的预测值。我们论证该框架的总体目标揭示了编码层与解码层中两个回归模型之间的潜在关联。通过总结嵌套回归问题对两个联合优化的正则化项的依赖关系,我们提出相应解决方案。在包含LeNet5、RepVGG及Vision Transformer (ViT)等不同数据集与多种机器学习模型上的广泛实验表明:NeRCC能在广泛的掉队者场景中精准逼近原始预测结果,较现有最优方法性能提升最高达23%。