JoinBoost: Grow Trees Over Normalized Data Using Only SQL

Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the $Y$ variable to the residual in the non-materialized join result. Although this view update problem is generally ambiguous, we identify addition-to-multiplication preserving, the key property of variance semi-ring to support rmse, the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3x (1.1x) faster for random forests (gradient boosting) compared to LightGBM, and over an order magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).

翻译：尽管表格数据的主流机器学习库（如LightGBM、XGBoost）训练树模型时需要将规范化数据库反规范化为单一表、物化并导出，但这一过程不可扩展、效率低下且存在安全风险。数据库内机器学习旨在DBMS内部训练模型，以避免数据移动并实现数据治理。是否有可能在不修改DBMS的情况下，仅通过SQL实现与专用机器学习库相媲美的树训练性能？我们提出JoinBoost——一个将规范化数据库上的树训练算法重写为纯SQL的Python库。该库可移植至任意DBMS，性能与专用机器学习库竞争，并随底层DBMS能力扩展。JoinBoost从算法和系统两个层面拓展了先前工作。算法层面，我们通过将非物化连接结果中的Y变量更新为残差，实现了因子化梯度提升。尽管这一视图更新问题通常具有歧义，但我们识别出加法到乘法保持性——方差半环支持RMSE（最广泛使用的准则）的关键特性。系统层面，我们识别出残差更新是性能瓶颈。通过创建残差值的新列并将其添加为投影，可在列式DBMS上原生最小化此类开销。我们在DuckDB上通过两种实现验证了这一点，对其内部结构几乎无需修改以保证可移植性。实验表明，相较于LightGBM，JoinBoost在随机森林（梯度提升）上快3倍（1.1倍），且比最先进的数据库内机器学习系统快一个数量级以上。此外，JoinBoost在特征数量、数据库规模（TPC-DS SF=1000）和连接图复杂度（星系模式）方面远优于LightGBM。