Random Forest (RF) is a popular tree-ensemble method for supervised learning, prized for its ease of use and flexibility. Online RF models require to account for new training data to maintain model accuracy. This is particularly important in applications where data is periodically and sequentially generated over time in data streams, such as auto-driving systems, and credit card payments. In this setting, performing periodic model retraining with the old and new data accumulated is beneficial as it fully captures possible drifts in the data distribution over time. However, this is unpractical with state-of-the-art classical algorithms for RF as they scale linearly with the accumulated number of samples. We propose QC-Forest, a classical-quantum algorithm designed to time-efficiently retrain RF models in the streaming setting for multi-class classification and regression, achieving a runtime poly-logarithmic in the total number of accumulated samples. QC-Forest leverages Des-q, a quantum algorithm for single tree construction and retraining proposed by Kumar et al. by expanding to multi-class classification, as the original proposal was limited to binary classes, and introducing an exact classical method to replace an underlying quantum subroutine incurring a finite error, while maintaining the same poly-logarithmic dependence. Finally, we showcase that QC-Forest achieves competitive accuracy in comparison to state-of-the-art RF methods on widely used benchmark datasets with up to 80,000 samples, while significantly speeding up the model retrain.
翻译:随机森林(RF)是一种流行的树集成监督学习方法,以其易用性和灵活性而备受推崇。在线RF模型需要处理新增训练数据以维持模型精度。这在数据随时间周期性顺序生成的数据流应用(如自动驾驶系统和信用卡支付)中尤为重要。在此场景下,利用累积的历史数据与新增数据进行周期性模型重训练具有显著优势,因其能完整捕捉数据分布随时间可能产生的漂移。然而,现有经典RF算法因计算复杂度随累积样本数线性增长而难以实现该过程。本文提出QC-Forest——一种专为流式场景多分类与回归任务设计的经典-量子高效重训练算法,其运行时间与累积样本总数呈多对数关系。QC-Forest通过扩展Kumar等人提出的单棵树构建与重训练量子算法Des-q实现:将原仅支持二分类的算法拓展至多分类任务,并引入精确经典方法替代原有存在有限误差的量子子程序,同时保持相同的多对数复杂度依赖。最后,我们在包含多达80,000个样本的常用基准数据集上验证了QC-Forest与前沿RF方法相比具有竞争力的精度,同时显著加速了模型重训练过程。