Machine learning inference pipelines commonly encountered in data science and industries often require real-time responsiveness due to their user-facing nature. However, meeting this requirement becomes particularly challenging when certain input features require aggregating a large volume of data online. Recent literature on interpretable machine learning reveals that most machine learning models exhibit a notable degree of resilience to variations in input. This suggests that machine learning models can effectively accommodate approximate input features with minimal discernible impact on accuracy. In this paper, we introduce Biathlon, a novel ML serving system that leverages the inherent resilience of models and determines the optimal degree of approximation for each aggregation feature. This approach enables maximum speedup while ensuring a guaranteed bound on accuracy loss. We evaluate Biathlon on real pipelines from both industry applications and data science competitions, demonstrating its ability to meet real-time latency requirements by achieving 5.3x to 16.6x speedup with almost no accuracy loss.
翻译:机器学习推理流水线在数据科学和工业应用中常见,因其面向用户的特性,通常需要实时响应能力。然而,当某些输入特征需要在线聚合大量数据时,满足这一需求变得尤为困难。近期关于可解释机器学习的研究表明,大多数机器学习模型对输入变化展现出显著弹性。这意味着模型能够有效容纳近似输入特征,而对精度的影响微乎其微。本文提出Biathlon,一种新型机器学习服务系统,它利用模型固有的弹性,为每个聚合特征确定最优近似程度。该方法可在保证精度损失有界的前提下,实现最大加速比。我们通过在工业应用与数据科学竞赛的真实流水线上评估Biathlon,证明其能够实现5.3倍至16.6倍的加速且几乎无精度损失,从而满足实时延迟需求。