Learned Sparse Retrieval (LSR) is a group of neural methods designed to encode queries and documents into sparse lexical vectors. These vectors can be efficiently indexed and retrieved using an inverted index. While LSR has shown promise in text retrieval, its potential in multi-modal retrieval remains largely unexplored. Motivated by this, in this work, we explore the application of LSR in the multi-modal domain, i.e., we focus on Multi-Modal Learned Sparse Retrieval (MLSR). We conduct experiments using several MLSR model configurations and evaluate the performance on the image suggestion task. We find that solving the task solely based on the image content is challenging. Enriching the image content with its caption improves the model performance significantly, implying the importance of image captions to provide fine-grained concepts and context information of images. Our approach presents a practical and effective solution for training LSR retrieval models in multi-modal settings.
翻译:学习型稀疏检索(Learned Sparse Retrieval, LSR)是一类将查询和文档编码为稀疏词向量的神经方法。这些向量可利用倒排索引实现高效索引与检索。尽管LSR在文本检索中展现出潜力,但其在多模态检索领域的应用尚未得到充分探索。受此启发,本文探究了LSR在多模态领域的应用,具体聚焦于多模态学习型稀疏检索(Multi-Modal Learned Sparse Retrieval, MLSR)。我们通过多种MLSR模型配置开展实验,并在图像推荐任务上评估其性能。研究发现,仅基于图像内容解决该任务具有挑战性。通过用图像标题丰富图像内容,模型性能得到显著提升,这揭示了图像标题对提供细粒度概念及图像上下文信息的重要性。本文所提方法为在多模态场景下训练LSR检索模型提供了一种实用且有效的解决方案。