Citation field learning is to segment a citation string into fields of interest such as author, title, and venue. Extracting such fields from citations is crucial for citation indexing, researcher profile analysis, etc. User-generated resources like academic homepages and Curriculum Vitae, provide rich citation field information. However, extracting fields from these resources is challenging due to inconsistent citation styles, incomplete sentence syntax, and insufficient training data. To address these challenges, we propose a novel algorithm, CIFAL (citation field learning by anchor learning), to boost the citation field learning performance. CIFAL leverages the anchor learning, which is model-agnostic for any Pre-trained Language Model, to help capture citation patterns from the data of different citation styles. The experiments demonstrate that CIFAL outperforms state-of-the-art methods in citation field learning, achieving a 2.68% improvement in field-level F1-scores. Extensive analysis of the results further confirms the effectiveness of CIFAL quantitatively and qualitatively.
翻译:引文字段学习旨在将引文字符串分割成作者、标题、会议等目标字段。从引文中提取这些字段对于引文索引、研究人员画像分析等任务至关重要。用户生成的资源(如学术主页和简历)提供了丰富的引文字段信息。然而,由于引文格式不一致、句子句法不完整以及训练数据不足,从这些资源中提取字段面临挑战。为解决这些难题,我们提出了一种新颖的算法——CIFAL(基于锚点学习的引文字段学习),以提升引文字段学习性能。CIFAL利用锚点学习(该技术对任意预训练语言模型具有模型无关性)来帮助从不同引文格式的数据中捕获引文模式。实验表明,CIFAL在引文字段学习任务上优于现有最优方法,字段级F1值提升了2.68%。对结果的深入分析从定量和定性两方面进一步证实了CIFAL的有效性。