Named Entity Recognition for Payment Data Using NLP

Named Entity Recognition (NER) has emerged as a critical component in automating financial transaction processing, particularly in extracting structured information from unstructured payment data. This paper presents a comprehensive analysis of state-of-the-art NER algorithms specifically designed for payment data extraction, including Conditional Random Fields (CRF), Bidirectional Long Short-Term Memory with CRF (BiLSTM-CRF), and transformer-based models such as BERT and FinBERT. We conduct extensive experiments on a dataset of 50,000 annotated payment transactions across multiple payment formats including SWIFT MT103, ISO 20022, and domestic payment systems. Our experimental results demonstrate that fine-tuned BERT models achieve an F1-score of 94.2% for entity extraction, outperforming traditional CRF-based approaches by 12.8 percentage points. Furthermore, we introduce PaymentBERT, a novel hybrid architecture combining domain-specific financial embeddings with contextual representations, achieving state-of-the-art performance with 95.7% F1-score while maintaining real-time processing capabilities. We provide detailed analysis of cross-format generalization, ablation studies, and deployment considerations. This research provides practical insights for financial institutions implementing automated sanctions screening, anti-money laundering (AML) compliance, and payment processing systems.

翻译：命名实体识别（NER）已成为金融交易处理自动化的关键组成部分，特别是在从非结构化支付数据中提取结构化信息方面。本文针对专为支付数据提取设计的最先进NER算法进行了全面分析，包括条件随机场（CRF）、双向长短期记忆网络与CRF的组合模型（BiLSTM-CRF），以及基于Transformer的模型如BERT和FinBERT。我们在包含SWIFT MT103、ISO 20022及国内支付系统等多种支付格式的50,000条标注支付交易数据集上进行了大量实验。实验结果表明，经过微调的BERT模型在实体提取任务中取得了94.2%的F1分数，较传统基于CRF的方法提升了12.8个百分点。此外，我们提出了PaymentBERT——一种结合领域特定金融嵌入与上下文表征的新型混合架构，该模型在保持实时处理能力的同时，以95.7%的F1分数实现了最先进的性能表现。我们对跨格式泛化能力、消融实验及部署考量进行了详细分析。本研究为金融机构实施自动化制裁名单筛查、反洗钱（AML）合规及支付处理系统提供了实践指导。

相关内容

条件随机场

关注 341

条件随机域（场）（conditional random fields，简称 CRF，或CRFs），是一种判别式概率模型，是随机场的一种，常用于标注或分析序列资料，如自然语言文字或是生物序列。如同马尔可夫随机场，条件随机场为具有无向的图模型，图中的顶点代表随机变量，顶点间的连线代表随机变量间的相依关系，在条件随机场中，随机变量 Y 的分布为条件机率，给定的观察值则为随机变量 X。原则上，条件随机场的图模型布局是可以任意给定的，一般常用的布局是链结式的架构，链结式架构不论在训练（training）、推论（inference）、或是解码（decoding）上，都存在效率较高的算法可供演算。

「中文电子病历命名实体识别」的研究与进展

专知会员服务

32+阅读 · 2022年11月5日

中文领域命名实体识别综述

专知会员服务

72+阅读 · 2021年8月20日

【KDD2021】强化迭代知识蒸馏的跨语言命名实体识别

专知会员服务

26+阅读 · 2021年6月17日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

62+阅读 · 2020年5月15日