Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.
翻译:科学摘要常被用作研究出版物内容与主题焦点的代理指标。然而,大量已发表的摘要包含无关信息——例如出版商版权声明、章节标题、作者注释、注册信息以及文献计量或书目元数据——这些信息可能扭曲下游分析,特别是涉及文档相似性或文本嵌入的分析。我们提出一种开源且易于集成的语言模型,旨在通过自动识别并移除此类冗余信息来清洗英文科学摘要。我们证明该模型兼具保守性与精确性,能改变清洗后摘要的相似性排序,并提升标准长度嵌入的信息含量。