Disambiguating scholars with identical names is essential for accurate authorship assignment and robust large-scale scientometric research. Existing methods are often designed for Latin-script metadata and perform poorly on Chinese names. In international publications, Chinese names typically appear as Romanized Pinyin, which is highly ambiguous as it can map to multiple distinct characters. Chinese characters, in contrast, reduce but do not eliminate this ambiguity, and are rarely available in international records. To address both challenges, we propose a rule-based disambiguation framework that integrates co-authorship networks, citation networks, author affiliations, and content similarity. We apply this framework to 65,241 physics papers from the China National Knowledge Infrastructure (CNKI), spanning over 70 years of data. On a human annotated sample of 80 name pairs, our method achieves F1-scores of 0.88 for Pinyin names and 0.89 for character-based names, outperforming two baseline approaches, with improvements driven primarily by higher recall. The comparable performance across both writing systems shows that our approach is script-agnostic, enabling reliable large-scale scientometric analyses.
翻译:作者姓名消歧是实现精准 authorship归属和稳健大规模科学计量研究的关键。现有方法多针对拉丁字母元数据设计,在中文姓名场景下表现欠佳。国际出版物中的中文姓名通常以罗马化拼音形式呈现,这种高度歧义的表达可对应多个不同汉字。汉字虽能降低歧义性,但无法完全消除歧义,且国际文献数据库很少收录汉字形式。为应对双重挑战,我们提出基于规则的去歧框架,融合合著网络、引文网络、作者机构归属及内容相似度指标。将该框架应用于中国知网收录的65241篇物理学论文(覆盖70余年数据),在人工标注的80对姓名样本上,本方法对拼音姓名和汉字姓名分别达到0.88和0.89的F1分数,优于两种基线方法,提升主要源于更高召回率。跨书写系统的表现可比性表明本方法具有脚本无关性,可实现可靠的大规模科学计量分析。