Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified compiler, called CMLCompiler, for CML inference. We propose two unified abstractions: operator representations and extended computational graphs. The CMLCompiler framework performs the conversion and graph optimization based on two unified abstractions, then outputs an optimized computational graph to DL compilers or frameworks. We implement CMLCompiler on TVM. The evaluation shows CMLCompiler's portability and superior performance. It achieves up to 4.38$\times$ speedup on CPU, 3.31$\times$ speedup on GPU, and 5.09$\times$ speedup on IoT devices, compared to the state-of-the-art solutions -- scikit-learn, intel sklearn, and hummingbird. Our performance of CML and DL mixed pipelines achieves up to 3.04x speedup compared with cross-framework implementations. The project documents and source code are available at https://www.computercouncil.org/cmlcompiler.
翻译:经典机器学习(CML)在工业生产应用中的机器学习流水线中占据近半壁江山。然而,现有CML无法充分利用当前主流计算设备,导致性能表现不佳。由于缺乏统一框架,深度学习(DL)与CML的混合部署方案还面临严重的性能与可移植性问题。本文提出面向CML推理的统一编译器CMLCompiler的设计方案。我们提出两类统一抽象:算子表示与扩展计算图。CMLCompiler框架基于这两类统一抽象执行转换与图优化,最终输出优化计算图至DL编译器或框架。我们在TVM上实现CMLCompiler。评估结果表明,CMLCompiler具备卓越的可移植性与性能优势。与当前最优方案(scikit-learn、intel sklearn及hummingbird)相比,在CPU上实现高达4.38倍加速,在GPU上实现3.31倍加速,在物联网设备上实现5.09倍加速。在CML与DL混合流水线场景中,与跨框架实现方案相比,性能提升最高达3.04倍。项目文档与源代码见https://www.computercouncil.org/cmlcompiler。