Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR's generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.
翻译:现有的信息检索模型通常假设数据格式同质化,限制了其满足多样化用户需求的能力,例如通过文本描述搜索图像、通过标题图像搜索新闻文章,或通过查询图像寻找相似照片。为应对这些不同的信息获取需求,我们提出了UniIR——一种统一的指令引导多模态检索器,能够处理跨模态的八种不同检索任务。作为单一检索系统,UniIR在十个多样化的多模态信息检索数据集上联合训练,通过解读用户指令执行各类检索任务,在现有数据集上展现出稳健性能,并具备对新任务的零样本泛化能力。实验表明,多任务训练与指令调优是UniIR泛化能力的关键。此外,我们构建了M-BEIR多模态检索基准,提供全面的评估结果,以标准化通用多模态信息检索的评价体系。