Many popular machine learning models scale poorly when deployed on CPUs. In this paper we explore the reasons why and propose a simple, yet effective approach based on the well-known Divide-and-Conquer Principle to tackle this problem of great practical importance. Given an inference job, instead of using all available computing resources (i.e., CPU cores) for running it, the idea is to break the job into independent parts that can be executed in parallel, each with the number of cores according to its expected computational cost. We implement this idea in the popular OnnxRuntime framework and evaluate its effectiveness with several use cases, including the well-known models for optical character recognition (PaddleOCR) and natural language processing (BERT).
翻译:许多流行的机器学习模型在CPU上部署时扩展性较差。本文探究了其根本原因,并基于经典的分治原理提出了一种简单却有效的方案,以解决这一具有重要实际意义的问题。针对给定的推理任务,该方案不再使用所有可用计算资源(即CPU核心)执行任务,而是将其分解为可并行执行的独立子任务,每个子任务根据其预期计算成本分配相应数量的核心。我们在流行的OnnxRuntime框架中实现了这一方案,并通过多个用例验证了其有效性,包括光学字符识别(PaddleOCR)和自然语言处理(BERT)等知名模型。