In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing. For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10. These experiments show that DSE is an effective document retrieval paradigm for diverse types of documents. Model checkpoints, code, and Wiki-SS collection will be released.
翻译:现实世界中,文档以不同格式和多种模态组织。传统检索流程需要定制化的文档解析技术和内容提取模块来准备索引输入。该过程繁琐、易出错且存在信息损失。为此,我们提出文档截图嵌入(Document Screenshot Embedding, DSE)这一新型检索范式,将文档截图视为统一输入格式,无需任何内容提取预处理,并保留文档中的所有信息(如文本、图像与版式)。DSE利用大规模视觉-语言模型直接将文档截图编码为稠密表示以进行检索。为评估本方法,我们首先构建了Wiki-SS数据集——包含130万张维基百科网页截图作为语料库,用于回答自然问答数据集中的问题。在此类文本密集型文档检索场景中,DSE相比依赖解析的文本检索方法展现出可比的效能。例如,DSE在top-1检索准确率上优于BM25达17个百分点。此外,在幻灯片混合模态检索任务中,DSE在nDCG@10指标上显著超越OCR文本检索方法超过15个百分点。这些实验表明,DSE是适用于多种文档类型的有效检索范式。模型检查点、代码及Wiki-SS数据集将予以公开。