Named Entity Recognition (NER) is a critical component of Natural Language Processing with diverse applications in information extraction and conversational AI. However, NER in specific domains for low-resource languages faces challenges such as limited annotated data and heterogeneous label sets. This study addresses these issues by proposing a hybrid neurosymbolic framework that integrates rule-based processing with deep learning models for Vietnamese NER. The core idea involves a two-stage pipeline: first, a rule-based component reduces label complexity by grouping relational and special categories; second, pre-trained language models are fine-tuned for high-precision extraction. A post-processing module is then utilized to restore fine-grained labels, preserving expressiveness for application-level usability. To mitigate data scarcity, a scalable data augmentation strategy leveraging Large Language Models (LLMs) is introduced to expand the label set without full re-annotation, which is a significant novelty of this work. The effectiveness of this method was evaluated across five specific-domain datasets, including logistics, wildlife, and healthcare. Experimental results demonstrate substantial improvements over strong RoBERTa-based baselines. Specifically, the proposed system achieved F1 scores of 90 percent in Customer Service, up from 83 percent; 84 percent in GAM, up from 73 percent; 83 percent in AI Fluent, up from 80 percent; 94 percent in PhoNER_Covid19, up from 91 percent; and 60 percent in Rare Wildlife, up from 36 percent. These findings confirm that the hybrid approach effectively captures the linguistic complexity of Vietnamese and contextual nuances in specialized domains, offering a robust contribution to low-resource NER research.
翻译:命名实体识别(NER)是自然语言处理的关键组成部分,在信息抽取和对话式AI等领域具有广泛应用。然而,低资源语言特定领域的NER面临标注数据有限、标签集异构等挑战。本研究通过提出一种融合规则处理与深度学习模型的神经符号混合框架来解决越南语NER中的上述问题。核心思想包含两阶段流水线:首先,基于规则的组件通过归并关系类和特殊类来降低标签复杂度;其次,对预训练语言模型进行微调以实现高精度抽取。随后利用后处理模块恢复细粒度标签,保留应用层面的可表达性。为缓解数据稀缺问题,本文引入一种可扩展的数据增强策略——利用大语言模型(LLMs)扩展标签集而无需完全重新标注,这是本工作的重要创新点。该方法在包含物流、野生动物和医疗保健等五个领域数据集上进行了评估。实验结果表明,该方法相较于基于RoBERTa的强基线模型取得了显著提升。具体而言,所提出系统在客户服务领域F1得分从83%提升至90%,GAM领域从73%提升至84%,AI Fluent领域从80%提升至83%,PhoNER_Covid19领域从91%提升至94%,稀有野生动物领域从36%提升至60%。这些发现证实了混合方法能有效捕捉越南语的复杂语言特征及专业领域的上下文细微差异,为低资源NER研究提供了有力贡献。