科学出版物英文摘要清洗 (Cleaning English Abstracts of Scientific Publications) - 专知论文

会员服务 ·

0

分析 · 相似性 · 嵌入 · 包含 · 声明 ·

2025 年 12 月 30 日

Cleaning English Abstracts of Scientific Publications

翻译：科学出版物英文摘要清洗

Michael E. Rose,Nils A. Herrmann,Sebastian Erhardt

from arxiv, 2 tables, 2 figures

Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.

翻译：科学摘要常被用作研究出版物内容与主题焦点的代理指标。然而，大量已发表的摘要包含无关信息——例如出版商版权声明、章节标题、作者注释、注册信息以及文献计量或书目元数据——这些信息可能扭曲下游分析，特别是涉及文档相似性或文本嵌入的分析。我们提出一种开源且易于集成的语言模型，旨在通过自动识别并移除此类冗余信息来清洗英文科学摘要。我们证明该模型兼具保守性与精确性，能改变清洗后摘要的相似性排序，并提升标准长度嵌入的信息含量。

0

相关内容

【简明书】视频摘要概述，55页pdf

【简明书】视频摘要概述，55页pdf

专知会员服务

36+阅读 · 2022年10月24日

长文档摘要如何做？莫纳什大学最新《长文档摘要》综述，39页pdf长文档摘要的实证研究:数据集、模型和指标

长文档摘要如何做？莫纳什大学最新《长文档摘要》综述，39页pdf长文档摘要的实证研究:数据集、模型和指标

专知会员服务

36+阅读 · 2022年7月10日

如何做好科研？德国图宾根大学Andreas这份《科研阅读、写作与报告》82页PPT，手把手教你实操科研: 读写评讲论文

如何做好科研？德国图宾根大学Andreas这份《科研阅读、写作与报告》82页PPT，手把手教你实操科研: 读写评讲论文

专知会员服务

220+阅读 · 2022年4月13日

【干货书】撰写和发表科研论文，216页专门为非英语科研工作者定制

【干货书】撰写和发表科研论文，216页专门为非英语科研工作者定制

专知会员服务

96+阅读 · 2021年7月9日

【干货书】大数据小摘要，272页pdf，剑桥大学出版社

【干货书】大数据小摘要，272页pdf，剑桥大学出版社

专知会员服务

42+阅读 · 2021年7月6日

自动文本摘要研究综述

自动文本摘要研究综述

专知会员服务

68+阅读 · 2021年1月31日

【论文推荐】文本摘要简述

【论文推荐】文本摘要简述

专知会员服务

69+阅读 · 2020年7月20日

干净的数据：数据清洗入门与实践，204页pdf

干净的数据：数据清洗入门与实践，204页pdf

专知会员服务

164+阅读 · 2020年5月14日

【微软】利用知识图谱提高抽象摘要的事实正确性，Boosting Factual Correctness

专知会员服务

18+阅读 · 2020年3月23日

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

专知会员服务

42+阅读 · 2020年3月17日

滑铁卢大学2020新书《预训练Transformer模型文本排序》，155页pdf

滑铁卢大学2020新书《预训练Transformer模型文本排序》，155页pdf

专知

10+阅读 · 2020年10月19日

《文本分类大综述：从浅层到深度学习》最新2020版35页pdf

《文本分类大综述：从浅层到深度学习》最新2020版35页pdf

专知

59+阅读 · 2020年8月6日

面试题：文本摘要中的NLP技术

面试题：文本摘要中的NLP技术

七月在线实验室

15+阅读 · 2019年5月13日

用深度学习做文本摘要

用深度学习做文本摘要

专知

24+阅读 · 2019年3月30日

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

THU数据派

11+阅读 · 2019年3月25日

赛尔原创 | 文本摘要简述

赛尔原创 | 文本摘要简述

哈工大SCIR

22+阅读 · 2019年3月25日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

牛！中国版Sci-Hub，还能下载中文文献！

牛！中国版Sci-Hub，还能下载中文文献！

材料科学与工程

26+阅读 · 2018年8月26日

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

机器学习算法与Python学习

10+阅读 · 2018年5月28日

每周论文清单：高质量文本生成，多模态情感分析，还有一大波GAN | PaperDaily #26

每周论文清单：高质量文本生成，多模态情感分析，还有一大波GAN | PaperDaily #26

PaperWeekly

12+阅读 · 2017年12月14日

图文混合跨媒体知识单元的模糊分类方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

不确定知识图谱中面向结构查询的众包清洗研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于复杂语义的个性化图像集摘要研究

国家自然科学基金

0+阅读 · 2015年12月31日

系统科学与复杂性学报（英文版）

国家自然科学基金

12+阅读 · 2015年12月31日

提升《高校应用数学学报》的影响力

国家自然科学基金

0+阅读 · 2015年8月31日

基于业务流程再造的科技期刊数字化出版模式研究

国家自然科学基金

0+阅读 · 2014年12月31日

中英文论文中的中国作者姓名消歧研究

国家自然科学基金

0+阅读 · 2014年12月31日

上市公司文本信息分析研究：基于大数据的视角

国家自然科学基金

8+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

中美科学基金资助与知识生产比较研究

国家自然科学基金

1+阅读 · 2014年12月31日

The 'Big Three' of Scientific Information: A comparative bibliometric review of Web of Science, Scopus, and OpenAlex

Arxiv

0+阅读 · 1月29日

The Persistence of Retracted Papers on Wikipedia

Arxiv

0+阅读 · 1月26日

Extractive summarization on a CMOS Ising machine

Arxiv

0+阅读 · 1月16日

DiSCo: Making Absence Visible in Intelligent Summarization Interfaces

Arxiv

0+阅读 · 1月13日

SECite: Analyzing and Summarizing Citations in Software Engineering Literature

Arxiv

0+阅读 · 1月12日

SciClaims: An End-to-End Generative System for Biomedical Claim Analysis

Arxiv

0+阅读 · 1月8日

Improving Scientific Document Retrieval with Academic Concept Index

Arxiv

0+阅读 · 1月2日

PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing

Arxiv

0+阅读 · 1月1日

Comparing Approaches to Automatic Summarization in Less-Resourced Languages

Arxiv

0+阅读 · 2025年12月30日

Not too long do read: Evaluating LLM-generated extreme scientific summaries

Arxiv

0+阅读 · 2025年12月29日

VIP会员

文章信息

相关主题

相关VIP内容

【简明书】视频摘要概述，55页pdf

【简明书】视频摘要概述，55页pdf

专知会员服务

36+阅读 · 2022年10月24日

长文档摘要如何做？莫纳什大学最新《长文档摘要》综述，39页pdf长文档摘要的实证研究:数据集、模型和指标

长文档摘要如何做？莫纳什大学最新《长文档摘要》综述，39页pdf长文档摘要的实证研究:数据集、模型和指标

专知会员服务

36+阅读 · 2022年7月10日

如何做好科研？德国图宾根大学Andreas这份《科研阅读、写作与报告》82页PPT，手把手教你实操科研: 读写评讲论文

如何做好科研？德国图宾根大学Andreas这份《科研阅读、写作与报告》82页PPT，手把手教你实操科研: 读写评讲论文

专知会员服务

220+阅读 · 2022年4月13日

【干货书】撰写和发表科研论文，216页专门为非英语科研工作者定制

【干货书】撰写和发表科研论文，216页专门为非英语科研工作者定制

专知会员服务

96+阅读 · 2021年7月9日

【干货书】大数据小摘要，272页pdf，剑桥大学出版社

【干货书】大数据小摘要，272页pdf，剑桥大学出版社

专知会员服务

42+阅读 · 2021年7月6日

自动文本摘要研究综述

自动文本摘要研究综述

专知会员服务

68+阅读 · 2021年1月31日

【论文推荐】文本摘要简述

【论文推荐】文本摘要简述

专知会员服务

69+阅读 · 2020年7月20日

干净的数据：数据清洗入门与实践，204页pdf

干净的数据：数据清洗入门与实践，204页pdf

专知会员服务

164+阅读 · 2020年5月14日

【微软】利用知识图谱提高抽象摘要的事实正确性，Boosting Factual Correctness

专知会员服务

18+阅读 · 2020年3月23日

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

专知会员服务

42+阅读 · 2020年3月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】基于自适应表征的高效视觉建模

《多域作战中融合网络、电子战与动能机动》

AI智能体时代大模型安全风险与攻防新挑战

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

相关资讯

滑铁卢大学2020新书《预训练Transformer模型文本排序》，155页pdf

滑铁卢大学2020新书《预训练Transformer模型文本排序》，155页pdf

专知

10+阅读 · 2020年10月19日

《文本分类大综述：从浅层到深度学习》最新2020版35页pdf

《文本分类大综述：从浅层到深度学习》最新2020版35页pdf

专知

59+阅读 · 2020年8月6日

面试题：文本摘要中的NLP技术

面试题：文本摘要中的NLP技术

七月在线实验室

15+阅读 · 2019年5月13日

用深度学习做文本摘要

用深度学习做文本摘要

专知

24+阅读 · 2019年3月30日

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

THU数据派

11+阅读 · 2019年3月25日

赛尔原创 | 文本摘要简述

赛尔原创 | 文本摘要简述

哈工大SCIR

22+阅读 · 2019年3月25日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

牛！中国版Sci-Hub，还能下载中文文献！

牛！中国版Sci-Hub，还能下载中文文献！

材料科学与工程

26+阅读 · 2018年8月26日

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

机器学习算法与Python学习

10+阅读 · 2018年5月28日

每周论文清单：高质量文本生成，多模态情感分析，还有一大波GAN | PaperDaily #26

每周论文清单：高质量文本生成，多模态情感分析，还有一大波GAN | PaperDaily #26

PaperWeekly

12+阅读 · 2017年12月14日

相关论文

The 'Big Three' of Scientific Information: A comparative bibliometric review of Web of Science, Scopus, and OpenAlex

Arxiv

0+阅读 · 1月29日

The Persistence of Retracted Papers on Wikipedia

Arxiv

0+阅读 · 1月26日

Extractive summarization on a CMOS Ising machine

Arxiv

0+阅读 · 1月16日

DiSCo: Making Absence Visible in Intelligent Summarization Interfaces

Arxiv

0+阅读 · 1月13日

SECite: Analyzing and Summarizing Citations in Software Engineering Literature

Arxiv

0+阅读 · 1月12日

SciClaims: An End-to-End Generative System for Biomedical Claim Analysis

Arxiv

0+阅读 · 1月8日

Improving Scientific Document Retrieval with Academic Concept Index

Arxiv

0+阅读 · 1月2日

PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing

Arxiv

0+阅读 · 1月1日

Comparing Approaches to Automatic Summarization in Less-Resourced Languages

Arxiv

0+阅读 · 2025年12月30日

Not too long do read: Evaluating LLM-generated extreme scientific summaries

Arxiv

0+阅读 · 2025年12月29日

相关基金

图文混合跨媒体知识单元的模糊分类方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

不确定知识图谱中面向结构查询的众包清洗研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于复杂语义的个性化图像集摘要研究

国家自然科学基金

0+阅读 · 2015年12月31日

系统科学与复杂性学报（英文版）

国家自然科学基金

12+阅读 · 2015年12月31日

提升《高校应用数学学报》的影响力

国家自然科学基金

0+阅读 · 2015年8月31日

基于业务流程再造的科技期刊数字化出版模式研究

国家自然科学基金

0+阅读 · 2014年12月31日

中英文论文中的中国作者姓名消歧研究

国家自然科学基金

0+阅读 · 2014年12月31日

上市公司文本信息分析研究：基于大数据的视角

国家自然科学基金

8+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

中美科学基金资助与知识生产比较研究

国家自然科学基金

1+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员