Subgraph representation learning is a technique for analyzing local structures (or shapes) within complex networks. Enabled by recent developments in scalable Graph Neural Networks (GNNs), this approach encodes relational information at a subgroup level (multiple connected nodes) rather than at a node level of abstraction. We posit that certain domain applications, such as anti-money laundering (AML), are inherently subgraph problems and mainstream graph techniques have been operating at a suboptimal level of abstraction. This is due in part to the scarcity of annotated datasets of real-world size and complexity, as well as the lack of software tools for managing subgraph GNN workflows at scale. To enable work in fundamental algorithms as well as domain applications in AML and beyond, we introduce Elliptic2, a large graph dataset containing 122K labeled subgraphs of Bitcoin clusters within a background graph consisting of 49M node clusters and 196M edge transactions. The dataset provides subgraphs known to be linked to illicit activity for learning the set of "shapes" that money laundering exhibits in cryptocurrency and accurately classifying new criminal activity. Along with the dataset we share our graph techniques, software tooling, promising early experimental results, and new domain insights already gleaned from this approach. Taken together, we find immediate practical value in this approach and the potential for a new standard in anti-money laundering and forensic analytics in cryptocurrencies and other financial networks.
翻译:子图表示学习是一种分析复杂网络中局部结构(或形态)的技术。得益于可扩展图神经网络的最新进展,该方法在子群级别(多个连接节点)而非节点抽象级别编码关系信息。我们认为,反洗钱等特定领域应用本质上属于子图问题,而主流图技术一直处于次优的抽象层次。部分原因在于缺乏真实规模与复杂度的注释数据集,以及缺少规模化处理子图GNN工作流的软件工具。为助力基础算法研究及反洗钱等领域的应用,我们引入Elliptic2——一个包含122K个已标注比特币集群子图的大型图数据集,其背景图由4900万个节点集群与1.96亿条边交易构成。该数据集提供已知与非法活动相关的子图,用于学习加密货币中洗钱行为呈现的"形态"集合,并准确识别新型犯罪活动。除数据集外,我们还将分享图技术、软件工具、初期实验成果以及通过该方法已获得的领域新见解。综合而言,我们发现该方法具有即时实用价值,并可能成为加密货币及其他金融网络中反洗钱与取证分析的新标准。