Name disambiguation -- a fundamental problem in online academic systems -- is now facing greater challenges with the increasing growth of research papers. For example, on AMiner, an online academic search platform, about 10% of names own more than 100 authors. Such real-world hard cases cannot be fully addressed by existing research efforts, because of the small-scale or low-quality datasets that they use to build algorithms. The development of effective algorithms is further hampered by a variety of tasks and evaluation protocols designed on top of diverse datasets. To this end, we present WhoIsWho owning, a large-scale benchmark with over 1,000,000 papers built using an interactive annotation process, a regular leaderboard with comprehensive tasks, and an easy-to-use toolkit encapsulating the entire pipeline as well as the most powerful features and baseline models for tackling the tasks. Our developed strong baseline has already been deployed online in the AMiner system to enable daily arXiv paper assignments. The documentation and regular leaderboards are publicly available at http://whoiswho.biendata.xyz/.
翻译:姓名消歧——在线学术系统中的基础性问题——正随着研究论文数量的增长面临更大挑战。例如,在在线学术搜索平台AMiner上,约10%的姓名对应超过100位作者。此类真实世界中的难点案例无法被现有研究完全解决,因其构建算法所依赖的数据集规模较小或质量较低。不同数据集上设计的多样化任务与评估协议进一步阻碍了有效算法的发展。为此,我们提出WhoIsWho,包含一个通过交互式标注过程构建的超百万论文大规模基准测试、一个涵盖综合任务的定期更新的排行榜,以及一个封装了完整流程及最强特征与基线模型的易用工具包。我们开发的强基线模型已部署于AMiner系统,用于日常arXiv论文分配。相关文档与定期更新的排行榜公开于http://whoiswho.biendata.xyz/。