Large language models (LLMs) have recently shown strong potential for ranking by capturing semantic relevance and adapting across diverse domains, yet existing methods remain constrained by limited context length and high computational costs, restricting their applicability to real-world scenarios where candidate pools often scale to millions. To address this challenge, we propose LRanker, a framework tailored for large-candidate ranking. LRanker incorporates a candidate aggregation encoder that leverages K-means clustering to explicitly model global candidate information, and a graph-based test-time scaling mechanism that partitions candidates into subsets, generates multiple query embeddings, and integrates them through an ensemble procedure. By aggregating diverse embeddings instead of relying on a single representation, this mechanism enhances robustness and expressiveness, leading to more accurate ranking over massive candidate pools. We evaluate LRanker on seven tasks across three scenarios in RBench with different candidate scales. Experimental results show that LRanker achieves over 30% gains in the RBench-Small scenario, improves by 3-9% in MRR in the RBench-Large scenario, and sustains scalability with 20-30% improvements in the RBench-Ultra scenario with more than 6.8M candidates. Ablation studies further verify the effectiveness of its key components. Together, these findings demonstrate the robustness, scalability, and effectiveness of LRanker for massive-candidate ranking.
翻译:大语言模型(LLM)近期通过捕捉语义相关性并在不同领域间自适应展现出了强大的排序潜力,但现有方法仍受限于有限的上下文长度和高昂的计算成本,难以应用于候选集规模可达百万级的真实场景。为应对这一挑战,我们提出LRanker——一个专为大规模候选集排序设计的框架。该框架包含候选聚合编码器(利用K-means聚类显式建模全局候选信息)和基于图的测试时扩展机制(将候选集划分为子集、生成多元查询嵌入并通过集成策略融合)。该机制通过聚合多样化嵌入而非依赖单一表征,增强了鲁棒性与表达能力,从而在海量候选池中实现更精准的排序。我们在RBench的三个场景(涵盖不同候选规模)的七项任务上评估了LRanker。实验结果表明:在RBench-Small场景中LRanker取得超30%的性能提升;在RBench-Large场景中MRR指标提升3-9%;在包含逾680万候选的RBench-Ultra场景中仍保持可扩展性,性能提升20-30%。消融实验进一步验证了其关键模块的有效性。这些发现共同证明了LRanker在海量候选集排序任务中的鲁棒性、可扩展性与有效性。