Random Forest (RF) is a popular tree-ensemble method for supervised learning, prized for its ease of use and flexibility. Online RF models require to account for new training data to maintain model accuracy. This is particularly important in applications where data is periodically and sequentially generated over time in data streams, such as auto-driving systems, and credit card payments. In this setting, performing periodic model retraining with the old and new data accumulated is beneficial as it fully captures possible drifts in the data distribution over time. However, this is unpractical with state-of-the-art classical algorithms for RF as they scale linearly with the accumulated number of samples. We propose QC-Forest, a classical-quantum algorithm designed to time-efficiently retrain RF models in the streaming setting for multi-class classification and regression, achieving a runtime poly-logarithmic in the total number of accumulated samples. QC-Forest leverages Des-q, a quantum algorithm for single tree construction and retraining proposed by Kumar et al. by expanding to multi-class classification, as the original proposal was limited to binary classes, and introducing an exact classical method to replace an underlying quantum subroutine incurring a finite error, while maintaining the same poly-logarithmic dependence. Finally, we showcase that QC-Forest achieves competitive accuracy in comparison to state-of-the-art RF methods on widely used benchmark datasets with up to 80,000 samples, while significantly speeding up the model retrain.
翻译:随机森林(RF)是一种流行的树集成监督学习方法,以其易用性和灵活性而备受推崇。在线RF模型需要整合新的训练数据以维持模型精度。这在数据随时间周期性顺序生成的数据流应用中尤为重要,例如自动驾驶系统和信用卡支付。在此场景下,利用累积的历史数据和新数据进行周期性模型重训练是有益的,因为它能完整捕捉数据分布随时间可能发生的漂移。然而,对于当前最先进的经典RF算法而言,这种做法并不实用,因为其计算复杂度随累积样本数量线性增长。我们提出QC-Forest——一种经典-量子混合算法,旨在数据流场景中高效重训练多分类与回归任务的RF模型,其运行时间与累积样本总数呈多对数关系。QC-Forest通过扩展Kumar等人提出的单棵树构建与重训练量子算法Des-q来实现这一目标:将原仅适用于二分类的算法拓展至多分类任务,并引入精确的经典方法替代原有会产生有限误差的量子子程序,同时保持相同的多对数复杂度依赖。最后,我们在包含多达80,000个样本的常用基准数据集上验证了QC-Forest的竞争力,其准确率与最先进的RF方法相当,同时显著加速了模型重训练过程。