In this paper, we present "Graph Feature Preprocessor", a software library for detecting typical money laundering and fraud patterns in financial transaction graphs in real time. These patterns are used to produce a rich set of transaction features for downstream machine learning training and inference tasks such as money laundering detection. We show that our enriched transaction features dramatically improve the prediction accuracy of gradient-boosting-based machine learning models. Our library exploits multicore parallelism, maintains a dynamic in-memory graph, and efficiently mines subgraph patterns in the incoming transaction stream, which enables it to be operated in a streaming manner. We evaluate our library using highly-imbalanced synthetic anti-money laundering (AML) and real-life Ethereum phishing datasets. In these datasets, the proportion of illicit transactions is very small, which makes the learning process challenging. Our solution, which combines our Graph Feature Preprocessor and gradient-boosting-based machine learning models, is able to detect these illicit transactions with higher minority-class F1 scores than standard graph neural networks. In addition, the end-to-end throughput rate of our solution executed on a multicore CPU outperforms the graph neural network baselines executed on a powerful V100 GPU. Overall, the combination of high accuracy, a high throughput rate, and low latency of our solution demonstrates the practical value of our library in real-world applications. Graph Feature Preprocessor has been integrated into IBM mainframe software products, namely "IBM Cloud Pak for Data on Z" and "AI Toolkit for IBM Z and LinuxONE".
翻译:本文介绍“图特征预处理器”(Graph Feature Preprocessor)软件库,用于实时检测金融交易图中的典型洗钱和欺诈模式。这些模式用于生成丰富的交易特征,服务于下游机器学习训练与推理任务(如洗钱检测)。实验表明,经富化的交易特征可显著提升基于梯度提升的机器学习模型的预测精度。该库利用多核并行机制,维护动态内存图,并高效挖掘流入交易流中的子图模式,从而实现流式处理。我们采用高度不平衡的合成反洗钱(AML)数据集与真实以太坊钓鱼数据集进行评估。在这些数据集中,非法交易占比极小,导致学习过程具有挑战性。结合图特征预处理器与基于梯度提升的机器学习模型,我们的解决方案能够以高于标准图神经网络的少数类F1分数检测这些非法交易。此外,该方案在多核CPU上执行时的端到端吞吐率优于在强大V100 GPU上运行的图神经网络基线方法。总体而言,该方案兼具高精度、高吞吐率与低延迟,展现了该库在实际应用中的实用价值。图特征预处理器已集成至IBM大型机软件产品“IBM Cloud Pak for Data on Z”与“AI Toolkit for IBM Z and LinuxONE”。