ATM Fraud Detection using Streaming Data Analytics

Gaining the trust and confidence of customers is the essence of the growth and success of financial institutions and organizations. Of late, the financial industry is significantly impacted by numerous instances of fraudulent activities. Further, owing to the generation of large voluminous datasets, it is highly essential that underlying framework is scalable and meet real time needs. To address this issue, in the study, we proposed ATM fraud detection in static and streaming contexts respectively. In the static context, we investigated a parallel and scalable machine learning algorithms for ATM fraud detection that is built on Spark and trained with a variety of machine learning (ML) models including Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Gradient Boosting Tree (GBT), and Multi-layer perceptron (MLP). We also employed several balancing techniques like Synthetic Minority Oversampling Technique (SMOTE) and its variants, Generative Adversarial Networks (GAN), to address the rarity in the dataset. In addition, we proposed a streaming based ATM fraud detection in the streaming context. Our sliding window based method collects ATM transactions that are performed within a specified time interval and then utilizes to train several ML models, including NB, RF, DT, and K-Nearest Neighbour (KNN). We selected these models based on their less model complexity and quicker response time. In both contexts, RF turned out to be the best model. RF obtained the best mean AUC of 0.975 in the static context and mean AUC of 0.910 in the streaming context. RF is also empirically proven to be statistically significant than the next-best performing models.

翻译：赢得客户信任是金融机构与组织发展壮大的核心。近年来，金融行业深受各类欺诈行为影响，加之海量数据集的产生，亟需构建具备可扩展性且满足实时需求的底层框架。为此，本研究分别针对静态场景和流式场景提出ATM欺诈检测方法。在静态场景中，我们基于Spark框架构建了并行可扩展的机器学习ATM欺诈检测系统，采用朴素贝叶斯(NB)、逻辑回归(LR)、支持向量机(SVM)、决策树(DT)、随机森林(RF)、梯度提升树(GBT)及多层感知机(MLP)等多种机器学习模型进行训练。针对数据集中的类别不平衡问题，我们引入合成少数类过采样技术(SMOTE)及其变体、生成对抗网络(GAN)等平衡方法。在流式场景中，我们提出基于滑动窗口的流式ATM欺诈检测方法：该方法收集指定时间区间内的ATM交易数据，继而用于训练NB、RF、DT及K近邻(KNN)等复杂度较低、响应速度较快的机器学习模型。实验结果表明，随机森林在两种场景下均表现最优——静态场景下平均AUC达0.975，流式场景下平均AUC达0.910。经验证，随机森林的性能显著优于次优模型。