Ethereum's rapid ecosystem expansion and transaction anonymity have triggered a surge in malicious activity. Detection mechanisms currently bifurcate into three technical strands: expert-defined features, graph embeddings, and sequential transaction patterns, collectively spanning the complete feature sets of Ethereum's native data layer. Yet the absence of cross-paradigm integration mechanisms forces practitioners to choose between sacrificing sequential context awareness, structured fund-flow patterns, or human-curated feature insights in their solutions. To bridge this gap, we propose KGBERT4Eth, a feature-complete pre-training encoder that synergistically combines two key components: (1) a Transaction Semantic Extractor, where we train an enhanced Transaction Language Model (TLM) to learn contextual semantic representations from conceptualized transaction records, and (2) a Transaction Knowledge Graph (TKG) that incorporates expert-curated domain knowledge into graph node embeddings to capture fund flow patterns and human-curated feature insights. We jointly optimize pre-training objectives for both components to fuse these complementary features, generating feature-complete embeddings. To emphasize rare anomalous transactions, we design a biased masking prediction task for TLM to focus on statistical outliers, while the Transaction TKG employs link prediction to learn latent transaction relationships and aggregate knowledge. Furthermore, we propose a mask-invariant attention coordination module to ensure stable dynamic information exchange between TLM and TKG during pre-training. KGBERT4Eth significantly outperforms state-of-the-art baselines in both phishing account detection and de-anonymization tasks, achieving absolute F1-score improvements of 8-16% on three phishing detection benchmarks and 6-26% on four de-anonymization datasets.
翻译:以太坊生态系统的快速扩张和交易匿名性导致恶意活动激增。当前的检测机制主要分为三大技术路线:专家定义特征、图嵌入和序列交易模式,它们共同覆盖了以太坊原生数据层的完整特征集。然而,由于缺乏跨范式集成机制,实践者不得不在解决方案中牺牲序列上下文感知能力、结构化资金流模式或人工标注特征洞察。为弥补这一缺陷,我们提出KGBERT4Eth——一个特征完备的预训练编码器,其协同整合了两个关键组件:(1) 交易语义提取器:通过训练增强型交易语言模型(TLM)从概念化交易记录中学习上下文语义表征;(2) 交易知识图谱(TKG):将专家标注的领域知识融入图节点嵌入,以捕捉资金流模式和人工标注特征洞察。我们通过联合优化两个组件的预训练目标来融合这些互补特征,生成特征完备的嵌入表示。为突出罕见异常交易,我们为TLM设计了偏置掩码预测任务以聚焦统计离群点,而交易TKG则采用链接预测来学习潜在交易关系并聚合知识。此外,我们提出掩码不变注意力协调模块,确保预训练期间TLM与TKG之间稳定的动态信息交互。KGBERT4Eth在钓鱼账户检测和去匿名化任务中均显著优于当前最先进的基线模型,在三个钓鱼检测基准上实现8-16%的绝对F1分数提升,在四个去匿名化数据集上取得6-26%的性能增益。