OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Hugo Laurençon,Lucile Saulnier,Léo Tronchon,Stas Bekman,Amanpreet Singh,Anton Lozhkov,Thomas Wang,Siddharth Karamcheti,Alexander M. Rush,Douwe Kiela,Matthieu Cord,Victor Sanh

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

翻译：基于自然文档（图像与文本交错排列）训练的大型多模态模型，在各种多模态基准测试中均优于仅使用图像文本配对训练的模型。然而，用于训练这些模型的数据集尚未公开发布，其收集流程也未得到完整说明。本文介绍OBELICS数据集——一个开放的大规模过滤交错图文文档数据集，包含从Common Crawl中提取的1.41亿个网页、3.53亿张关联图像以及1150亿文本标记。我们详述了数据集创建流程，提出了全面的过滤规则，并对数据集内容进行了分析。为验证OBELICS的实用性，我们训练了名为IDEFICS的90亿和800亿参数视觉语言模型，并在多个多模态基准测试中取得具有竞争力的性能。我们公开发布了数据集、模型及代码。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日