Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at github.com/thongnt99/lsr-multimodal
翻译:学习型稀疏检索是一类神经方法,能将查询和文档编码为可通过倒排索引高效存储与检索的稀疏词向量。本研究探索了稀疏检索在多模态领域的应用,重点关注文本-图像检索。虽然稀疏检索在文本检索中已取得成功,但其在多模态检索中的应用仍待充分研究。当前方法(如LexLIP和STAIR)需要在海量数据集上进行复杂的多阶段训练。本文提出的方法能高效地将冻结的密集模型中的稠密向量转换为稀疏词向量。我们通过一种新的训练算法,利用伯努利随机变量控制查询扩展,从而解决高维度共激活和语义偏差问题。在两个密集模型(BLIP、ALBEF)和两个数据集(MSCOCO、Flickr30k)上的实验表明,所提算法能有效减少共激活和语义偏差。我们性能最优的稀疏化模型以更短的训练时间和更低的GPU内存需求,超越了当前最先进的文本-图像稀疏检索模型。本研究为多模态环境下的稀疏检索训练提供了有效解决方案。相关代码和模型检查点已开源在 github.com/thongnt99/lsr-multimodal。