Citation field learning is to segment a citation string into fields of interest such as author, title, and venue. Extracting such fields from citations is crucial for citation indexing, researcher profile analysis, etc. User-generated resources like academic homepages and Curriculum Vitae, provide rich citation field information. However, extracting fields from these resources is challenging due to inconsistent citation styles, incomplete sentence syntax, and insufficient training data. To address these challenges, we propose a novel algorithm, CIFAL (citation field learning by anchor learning), to boost the citation field learning performance. CIFAL leverages the anchor learning, which is model-agnostic for any Pre-trained Language Model, to help capture citation patterns from the data of different citation styles. The experiments demonstrate that CIFAL outperforms state-of-the-art methods in citation field learning, achieving a 2.83% improvement in field-level F1-scores. Extensive analysis of the results further confirms the effectiveness of CIFAL quantitatively and qualitatively.
翻译:引文字段识别旨在将引文字符串分割为感兴趣的信息字段(如作者、标题和出处)。从引文中提取此类字段对于引文索引、研究人员画像分析等任务至关重要。学术主页和个人简历等用户生成资源提供了丰富的引文字段信息,但由于引文格式不一致、句子语法不完整以及训练数据不足,从这些资源中提取字段面临挑战。为解决这些问题,我们提出一种新颖算法CIFAL(基于锚点学习的引文字段识别),以提升引文字段识别的性能。CIFAL利用锚点学习机制——该机制对任意预训练语言模型均具有模型无关性——帮助从不同引文格式的数据中捕捉引文模式。实验表明,CIFAL在引文字段识别任务上优于现有最先进方法,字段级F1值提升2.83%。对结果的深入分析进一步从定量和定性两个维度验证了CIFAL的有效性。