UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.

翻译：摘要：重排序是许多信息检索流程中的关键组成部分。尽管在纯文本场景中取得了显著进展，多模态重排序仍面临挑战，尤其当候选集包含混合文本和图像项时。一个核心难点在于模态鸿沟：文本重排序器与文本候选的固有接近性会导致跨模态排序产生偏差且非最优。视觉语言模型通过强大的跨模态对齐缓解了这一鸿沟，并已被用于构建多模态重排序器。然而，现有基于视觉语言模型的重排序器通常将所有候选编码为图像，将文本视为图像会引入大量计算开销。同时，现有开源多模态重排序器多在通用领域数据上训练，在特定领域场景中性能欠佳。针对这些局限，我们提出UniRank——一种基于视觉语言模型的重排序框架，无需任何模态转换即可原生地对混合文本-图像候选进行评分与排序。基于该混合评分接口，UniRank构建了端到端的领域自适应流程，包括：(1) 指令微调阶段，通过将标签token似然映射为统一标量分数来学习校准后的跨模态相关性评分；(2) 基于难负例的偏好对齐阶段，构建领域内成对偏好，并基于人类反馈的强化学习执行查询级策略优化。在科学文献检索和设计专利搜索上的大量实验表明，UniRank一致性地优于当前最优基线方法，在Recall@1指标上分别提升8.9%和7.3%。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【博士论文】电商搜索中的排序学习

专知会员服务

13+阅读 · 2025年11月15日

【AAAI2026】URaG：面向高效长文档理解的多模态大语言模型统一检索与生成框架

专知会员服务

15+阅读 · 2025年11月14日

【RecSys22教程】多阶段推荐系统的神经重排序，90页ppt

专知会员服务

27+阅读 · 2022年9月30日

【Google】高效Transformer综述，Efficient Transformers: A Survey

专知会员服务

66+阅读 · 2022年3月17日