Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection

2024 年 9 月 16 日

翻译：延迟标签环境下实例增量学习与批量学习效能评估：基于表格数据流的欺诈检测实证研究

Kodjo Mawuena Amekoe,Mustapha Lebbah,Gregoire Jaffre,Hanene Azzag,Zaineb Chelly Dagdia

from arxiv, 20 pages

Real-world tabular learning production scenarios typically involve evolving data streams, where data arrives continuously and its distribution may change over time. In such a setting, most studies in the literature regarding supervised learning favor the use of instance incremental algorithms due to their ability to adapt to changes in the data distribution. Another significant reason for choosing these algorithms is \textit{avoid storing observations in memory} as commonly done in batch incremental settings. However, the design of instance incremental algorithms often assumes immediate availability of labels, which is an optimistic assumption. In many real-world scenarios, such as fraud detection or credit scoring, labels may be delayed. Consequently, batch incremental algorithms are widely used in many real-world tasks. This raises an important question: "In delayed settings, is instance incremental learning the best option regarding predictive performance and computational efficiency?" Unfortunately, this question has not been studied in depth, probably due to the scarcity of real datasets containing delayed information. In this study, we conduct a comprehensive empirical evaluation and analysis of this question using a real-world fraud detection problem and commonly used generated datasets. Our findings indicate that instance incremental learning is not the superior option, considering on one side state-of-the-art models such as Adaptive Random Forest (ARF) and other side batch learning models such as XGBoost. Additionally, when considering the interpretability of the learning systems, batch incremental solutions tend to be favored. Code: \url{https://github.com/anselmeamekoe/DelayedLabelStream}

翻译：现实世界中的表格学习生产场景通常涉及不断演进的数据流，其中数据持续到达且其分布可能随时间变化。在此类场景下，文献中大多数关于监督学习的研究倾向于采用实例增量算法，因其能够适应数据分布的变化。选择这些算法的另一个重要原因是\textit{避免像批量增量场景中常见的那样将观测数据存储在内存中}。然而，实例增量算法的设计通常假设标签立即可用，这是一种乐观的假设。在许多现实场景中，例如欺诈检测或信用评分，标签可能存在延迟。因此，批量增量算法在许多实际任务中被广泛使用。这引发了一个重要问题："在延迟设置下，就预测性能和计算效率而言，实例增量学习是否是最佳选择？"遗憾的是，由于包含延迟信息的真实数据集稀缺，该问题尚未得到深入研究。在本研究中，我们利用真实世界的欺诈检测问题和常用的生成数据集，对该问题进行了全面的实证评估与分析。我们的研究结果表明，一方面考虑到最先进的模型如自适应随机森林（ARF），另一方面考虑到批量学习模型如XGBoost，实例增量学习并非更优选择。此外，当考虑学习系统的可解释性时，批量增量解决方案往往更受青睐。代码：\url{https://github.com/anselmeamekoe/DelayedLabelStream}