CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data - 专知论文

会员服务 ·

0

语言模型化 · 可理解性 · MoDELS · Extensibility · WEB ·

2023 年 4 月 28 日

CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

翻译：CCpdf：从网络爬取数据构建高质量视觉丰富文档语料库

Michał Turski,Tomasz Stanisławek,Karol Kaczmarek,Paweł Dyda,Filip Graliński

from arxiv, Accepted at ICDAR 2023

In recent years, the field of document understanding has progressed a lot. A significant part of this progress has been possible thanks to the use of language models pretrained on large amounts of documents. However, pretraining corpora used in the domain of document understanding are single domain, monolingual, or nonpublic. Our goal in this paper is to propose an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl, as PDF files are the most canonical types of documents as considered in document understanding. We analysed extensively all of the steps of the pipeline and proposed a solution which is a trade-off between data quality and processing time. We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining. The dataset and tools published with this paper offer researchers the opportunity to develop even better multilingual language models.

翻译：近年来，文档理解领域取得了显著进展。这一进展很大程度上归功于在大量文档上预训练的语言模型的应用。然而，当前文档理解领域使用的预训练语料库存在领域单一、语言单一或非公开的问题。本文的目标是提出一种高效的流水线方法，利用Common Crawl从互联网上构建大规模、多样化、多语种的PDF文件语料库——鉴于PDF文件是文档理解领域中最规范的文档类型。我们对流水线的所有步骤进行了深入分析，并提出了一种在数据质量与处理时间之间取得平衡的解决方案。同时，我们以PDF文件索引形式共享CCpdf语料库，并附带下载脚本，所得数据集可用于语言模型预训练。本文发布的数据集和工具将为研究者开发更优质的多语种语言模型提供支持。

0

相关内容

语言模型化

语言模型化

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

专知会员服务

80+阅读 · 2019年10月27日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

80+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

AINLP

10+阅读 · 2019年2月9日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】MXNet深度情感分析实战

【推荐】MXNet深度情感分析实战

机器学习研究会

16+阅读 · 2017年10月4日

半线性广义Tricomi方程Cauchy问题解的生命跨度估计研究

国家自然科学基金

0+阅读 · 2017年12月31日

SMAD2调控ERK通路干预M2巨噬细胞活化在糖尿病肾病小鼠肾脏纤维化中的作用及机制

国家自然科学基金

0+阅读 · 2015年12月31日

HEV感染致脑组织损伤的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

Shh信号通路对CCL2调控在哮喘发生中的作用和机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

燃煤电厂烟气汞及典型重金属排放和脱除机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于时频二维训练信息的高谱效多天线TFT-OFDM技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

不同途径移植HUCB-MSCs治疗脑血管病大鼠microPET-CT评价及其治疗机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

形式化软件规约Radl获取、验证与确认方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于Web挖掘的图像和视频标注与搜索关键技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

Arxiv

0+阅读 · 2023年6月13日

Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard

Arxiv

0+阅读 · 2023年6月13日

HELP ME THINK: A Simple Prompting Strategy for Non-experts to Create Customized Content with Models

Arxiv

0+阅读 · 2023年6月12日

Sticker820K: Empowering Interactive Retrieval with Stickers

Arxiv

0+阅读 · 2023年6月12日

Self-Paced Absolute Learning Progress as a Regularized Approach to Curriculum Learning

Arxiv

0+阅读 · 2023年6月9日

CSKG: The CommonSense Knowledge Graph

CSKG: The CommonSense Knowledge Graph

Arxiv

18+阅读 · 2020年12月21日

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Arxiv

20+阅读 · 2020年3月10日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Arxiv

12+阅读 · 2019年9月23日

Strong Baselines for Simple Question Answering over Knowledge Graphs with and without Neural Networks

Arxiv

17+阅读 · 2018年6月5日

VIP会员

文章信息

相关主题

语言模型化

最新内容

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

专知会员服务

2+阅读 · 今天11:43

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

专知会员服务

2+阅读 · 今天11:41

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

专知会员服务

5+阅读 · 今天6:30

网状网络及其在军事领域的运用

网状网络及其在军事领域的运用

专知会员服务

5+阅读 · 今天6:18

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

专知会员服务

6+阅读 · 今天6:08

无美国参与的欧洲战争方式（万字长文）

无美国参与的欧洲战争方式（万字长文）

专知会员服务

6+阅读 · 今天5:54

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

专知会员服务

7+阅读 · 今天5:22

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

专知会员服务

7+阅读 · 今天5:15

《国防领域敏感性分析白皮书》

《国防领域敏感性分析白皮书》

专知会员服务

7+阅读 · 今天3:42

综述 | 从问答到任务完成：Agent系统与Harness设计

综述 | 从问答到任务完成：Agent系统与Harness设计

专知会员服务

5+阅读 · 6月24日

Agentic RL：框架、实践与长程智能体训练

Agentic RL：框架、实践与长程智能体训练

专知会员服务

7+阅读 · 6月24日

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

专知会员服务

10+阅读 · 6月24日

重新思考无人机时代的生存能力

重新思考无人机时代的生存能力

专知会员服务

9+阅读 · 6月24日

装甲突击旅：现代战争思考、战斗与组织

装甲突击旅：现代战争思考、战斗与组织

专知会员服务

7+阅读 · 6月24日

在人工智能加速决策环境中拓展OODA循环

在人工智能加速决策环境中拓展OODA循环

专知会员服务

9+阅读 · 6月24日

相关VIP内容

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

专知会员服务

80+阅读 · 2019年10月27日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

80+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

网状网络及其在军事领域的运用

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

AINLP

10+阅读 · 2019年2月9日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】MXNet深度情感分析实战

【推荐】MXNet深度情感分析实战

机器学习研究会

16+阅读 · 2017年10月4日

相关论文

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

Arxiv

0+阅读 · 2023年6月13日

Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard

Arxiv

0+阅读 · 2023年6月13日

HELP ME THINK: A Simple Prompting Strategy for Non-experts to Create Customized Content with Models

Arxiv

0+阅读 · 2023年6月12日

Sticker820K: Empowering Interactive Retrieval with Stickers

Arxiv

0+阅读 · 2023年6月12日

Self-Paced Absolute Learning Progress as a Regularized Approach to Curriculum Learning

Arxiv

0+阅读 · 2023年6月9日

CSKG: The CommonSense Knowledge Graph

CSKG: The CommonSense Knowledge Graph

Arxiv

18+阅读 · 2020年12月21日

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Arxiv

20+阅读 · 2020年3月10日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Arxiv

12+阅读 · 2019年9月23日

Strong Baselines for Simple Question Answering over Knowledge Graphs with and without Neural Networks

Arxiv

17+阅读 · 2018年6月5日

相关基金

半线性广义Tricomi方程Cauchy问题解的生命跨度估计研究

国家自然科学基金

0+阅读 · 2017年12月31日

SMAD2调控ERK通路干预M2巨噬细胞活化在糖尿病肾病小鼠肾脏纤维化中的作用及机制

国家自然科学基金

0+阅读 · 2015年12月31日

HEV感染致脑组织损伤的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

Shh信号通路对CCL2调控在哮喘发生中的作用和机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

燃煤电厂烟气汞及典型重金属排放和脱除机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于时频二维训练信息的高谱效多天线TFT-OFDM技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

不同途径移植HUCB-MSCs治疗脑血管病大鼠microPET-CT评价及其治疗机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

形式化软件规约Radl获取、验证与确认方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于Web挖掘的图像和视频标注与搜索关键技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员