Decision forest, including RandomForest, XGBoost, and LightGBM, is one of the most popular machine learning techniques used in many industrial scenarios, such as credit card fraud detection, ranking, and business intelligence. Because the inference process is usually performance-critical, a number of frameworks were developed and dedicated for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. However, these frameworks are all decoupled with data management frameworks. It is unclear whether in-database inference will improve the overall performance. In addition, these frameworks used different algorithms, optimization techniques, and parallelism models. It is unclear how these implementations will affect the overall performance and how to make design decisions for an in-database inference framework. In this work, we investigated the above questions by comprehensively comparing the end-to-end performance of the aforementioned inference frameworks and netsDB, an in-database inference framework we implemented. Through this study, we identified that netsDB is best suited for handling small-scale models on large-scale datasets and all-scale models on small-scale datasets, for which it achieved up to hundreds of times of speedup. In addition, the relation-centric representation we proposed significantly improved netsDB's performance in handling large-scale models, while the model reuse optimization we proposed further improved netsDB's performance in handling small-scale datasets.
翻译:决策森林(包括RandomForest、XGBoost和LightGBM)是许多工业场景中最流行的机器学习技术之一,例如信用卡欺诈检测、排序和商业智能。由于推理过程通常对性能至关重要,许多专门用于决策森林推理的框架应运而生,例如ONNX、Amazon的TreeLite、Google的TensorFlow Decision Forest、Microsoft的HummingBird、Nvidia FIL和lleaves。然而,这些框架均与数据管理框架解耦。目前尚不清楚数据库内推理是否能提升整体性能。此外,这些框架采用了不同的算法、优化技术和并行模型。目前尚不清楚这些实现方式如何影响整体性能,以及如何为数据库内推理框架做出设计决策。在本研究中,我们通过全面比较上述推理框架与我们实现的数据库内推理框架netsDB的端到端性能,探究了上述问题。通过这项研究,我们发现netsDB最适合处理大规模数据集上的小规模模型以及小规模数据集上的全尺度模型,在此类场景下其加速比可达数百倍。此外,我们提出的基于关系的表示方法显著提升了netsDB处理大规模模型的性能,而我们提出的模型重用优化则进一步提升了netsDB处理小规模数据集的性能。