In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4 (mmc4), an augmentation of the popular text-only c4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. mmc4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (90%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (78%). After filtering NSFW images, ads, etc., the corpus contains 103M documents containing 585M images interleaved with 43B English tokens.
翻译:上下文视觉与语言模型(如Flamingo)支持输入任意交错的图像与文本序列。这种格式不仅允许通过交错排列独立的监督式(图像、文本)示例实现少样本学习,还能处理涉及图像间交互的更复杂提示,例如“图像A与图像B有何共同之处?”为支持这一接口,模型需在同样包含交错图像与文本的网络语料库上进行预训练。然而,迄今为止,此类大规模数据尚未公开。我们发布多模态C4(mmc4),这是对流行的纯文本C4语料库进行图像交错扩充的版本。我们采用线性分配算法,利用CLIP特征将图像嵌入长文本段落中——实验表明该方法优于其他替代方案。mmc4覆盖烹饪、旅行、技术等日常主题。对随机抽样文档的人工核查显示,绝大多数(90%)图像具有主题相关性,且线性分配算法常能精准选择与图像高度匹配的独立句子(78%)。经过过滤不当内容(NSFW图像、广告等)后,该语料库包含1.03亿份文档,内含5.85亿张图像与430亿英文标记交错呈现。