MixLM: High-Throughput and Effective LLM Ranking via Text-Embedding Mix-Interaction

Guoyao Li,Ran He,Shusen Jing,Kayhan Behdin,Yubo Wang,Sundara Raman Ramachandran,Chanh Nguyen,Jian Sheng,Xiaojing Ma,Chuanrui Zhu,Sriram Vasudevan,Muchen Wu,Sayan Ghosh,Lin Su,Qingquan Song,Xiaoqing Wang,Zhipeng Wang,Qing Lan,Yanning Chen,Jingwei Wu,Luke Simon,Wenjing Zhang,Qi Guo,Fedor Borisyuk

Large language models (LLMs) excel at capturing semantic nuances and therefore show impressive relevance ranking performance in modern recommendation and search systems. However, they suffer from high computational overhead under industrial latency and throughput requirements. In particular, cross-encoder ranking systems often create long context prefill-heavy workloads, as the model has to be presented with the user, query and item information. To this end, we propose MixLM, a novel LLM-based ranking framework, which significantly improves the system throughput via reducing the input context length, while preserving the semantic strength of cross-encoder rankers. In contrast to a standard ranking system where the context is presented to the model as pure text, we propose to use mix-interaction, a mixture of text and embedding tokens to represent the input. Specifically, MixLM encodes all items in the catalog into a few embedding tokens and stores in a nearline cache. The encoded item descriptions are used during online inference, effectively reducing the item length from a few thousand text tokens to a few embedding tokens. We share insights from deploying our MixLM framework to a real-world search application at LinkedIn, including a detailed discussion of our training pipelines, as well as a thorough analysis of our online serving infrastructure optimization. With the same latency budget and on-par relevance metrics, MixLM increased throughput by 10.0x comparing with strong baselines, 75.9x over full-text LLM rerankers. The efficiency gains delivered by MixLM enabled full-traffic deployment of LLM-powered search, which resulted in a significant 0.47\% increase in Daily Active Users (DAU) in online A/B tests.

翻译：大语言模型（LLM）擅长捕捉语义细微差别，因此在现代推荐和搜索系统中展现出卓越的相关性排序性能。然而，在工业级延迟和吞吐量要求下，其计算开销过高。特别是交叉编码器排序系统通常会产生长上下文、预填充密集型工作负载，因为模型必须同时处理用户、查询和物品信息。为此，我们提出MixLM——一种基于LLM的新型排序框架，该框架通过缩短输入上下文长度显著提升系统吞吐量，同时保持交叉编码器排序器的语义强度。与标准排序系统将上下文以纯文本形式输入模型不同，我们提出使用混合交互——即文本与嵌入标记的混合表示来编码输入。具体而言，MixLM将目录中的所有物品编码为少量嵌入标记并存储于近线缓存中。在线推理时使用编码后的物品描述，从而将物品长度从数千个文本标记有效缩减至数个嵌入标记。我们分享了在LinkedIn实际搜索应用中部署MixLM框架的实践经验，包括对训练流程的详细探讨，以及在线服务基础设施优化的全面分析。在相同延迟预算和相当的相关性指标下，MixLM相比强基线实现了10.0倍的吞吐量提升，相较于全文本LLM重排序器提升达75.9倍。MixLM带来的效率增益使得支持LLM的搜索系统得以全流量部署，在线A/B测试中每日活跃用户数（DAU）因此显著提升0.47%。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

大型语言模型遇上文本属性图：一种融合框架与应用的综述

专知会员服务

10+阅读 · 2025年10月27日

LaCache：用于高效长上下文建模的大语言模型梯状KV缓存机制

专知会员服务

11+阅读 · 2025年7月23日

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日

大语言模型在序列推荐中的应用

专知会员服务

19+阅读 · 2024年11月12日