Scientific abstracts are increasingly used as primary data in computational metascience research, yet the quality of these abstracts in widely used bibliographic databases has not been systematically examined. We assess the integrity of 10,000 randomly sampled English-language journal abstracts from OpenAlex using a two-stage annotation protocol combining human expert review and large language model classification. We identify seven distinct failure modes and find that 12\% of abstracts have integrity issues, with insufficient content and misplaced metadata being the most prevalent. We discuss implications for downstream research and describe a forthcoming community portal to support collective annotation efforts.
翻译:科学摘要日益成为计算元科学研究中的主要数据来源,然而这些摘要广泛使用的书目数据库质量尚未得到系统性的检验。我们采用结合人类专家评审与大型语言模型分类的两阶段标注协议,对从OpenAlex随机抽取的10,000条英文期刊摘要进行了完整性评估。研究识别出七种不同的失效模式,发现12%的摘要存在完整性问题,其中内容不足与元数据错位最为常见。我们讨论了这些发现对下游研究的影响,并介绍了一个即将上线的社区门户以支持集体标注工作。