Fraud detection is to identify, monitor, and prevent potentially fraudulent activities from complex data. The recent development and success in AI, especially machine learning, provides a new data-driven way to deal with fraud. From a methodological point of view, machine learning based fraud detection can be divided into two categories, i.e., conventional methods (decision tree, boosting...) and deep learning, both of which have significant limitations in terms of the lack of representation learning ability for the former and interpretability for the latter. Furthermore, due to the rarity of detected fraud cases, the associated data is usually imbalanced, which seriously degrades the performance of classification algorithms. In this paper, we propose deep boosting decision trees (DBDT), a novel approach for fraud detection based on gradient boosting and neural networks. In order to combine the advantages of both conventional methods and deep learning, we first construct soft decision tree (SDT), a decision tree structured model with neural networks as its nodes, and then ensemble SDTs using the idea of gradient boosting. In this way we embed neural networks into gradient boosting to improve its representation learning capability and meanwhile maintain the interpretability. Furthermore, aiming at the rarity of detected fraud cases, in the model training phase we propose a compositional AUC maximization approach to deal with data imbalances at algorithm level. Extensive experiments on several real-life fraud detection datasets show that DBDT can significantly improve the performance and meanwhile maintain good interpretability. Our code is available at https://github.com/freshmanXB/DBDT.
翻译:欺诈检测旨在从复杂数据中识别、监控并预防潜在欺诈活动。人工智能(尤其是机器学习)的最新发展与成功为欺诈检测提供了新的数据驱动方法。从方法论角度,基于机器学习的欺诈检测可分为两类:传统方法(如决策树、提升算法)与深度学习方法。前者缺乏表征学习能力,后者可解释性不足,二者均存在显著局限性。此外,由于已检测欺诈案例的稀缺性,相关数据通常存在不平衡问题,严重降低了分类算法的性能。本文提出深度提升决策树(DBDT),一种基于梯度提升与神经网络的新型欺诈检测方法。为融合传统方法与深度学习的优势,我们首先构建软决策树(SDT)——一种以神经网络为节点的决策树结构模型,并利用梯度提升思想集成多个SDT。通过将神经网络嵌入梯度提升,该方法在提升表征学习能力的同时保持了可解释性。针对检测欺诈案例的稀缺性,我们在模型训练阶段提出组合AUC最大化方法,从算法层面应对数据不平衡问题。在多个真实欺诈检测数据集上的实验表明,DBDT能显著提升检测性能并保持良好可解释性。代码已开源至 https://github.com/freshmanXB/DBDT。