Large language models (LLMs) excel at capturing semantic nuances and therefore show impressive relevance ranking performance in modern recommendation and search systems. However, they suffer from high computational overhead under industrial latency and throughput requirements. In particular, cross-encoder ranking systems often create long context prefill-heavy workloads, as the model has to be presented with the user, query and item information. To this end, we propose MixLM, a novel LLM-based ranking framework, which significantly improves the system throughput via reducing the input context length, while preserving the semantic strength of cross-encoder rankers. In contrast to a standard ranking system where the context is presented to the model as pure text, we propose to use mix-interaction, a mixture of text and embedding tokens to represent the input. Specifically, MixLM encodes all items in the catalog into a few embedding tokens and stores in a nearline cache. The encoded item descriptions are used during online inference, effectively reducing the item length from a few thousand text tokens to a few embedding tokens. We share insights from deploying our MixLM framework to a real-world search application at LinkedIn, including a detailed discussion of our training pipelines, as well as a thorough analysis of our online serving infrastructure optimization. With the same latency budget and on-par relevance metrics, MixLM increased throughput by 10.0x comparing with strong baselines, 75.9x over full-text LLM rerankers. The efficiency gains delivered by MixLM enabled full-traffic deployment of LLM-powered search, which resulted in a significant 0.47\% increase in Daily Active Users (DAU) in online A/B tests.
翻译:大语言模型(LLM)擅长捕捉语义细微差别,因此在现代推荐和搜索系统中展现出卓越的相关性排序性能。然而,在工业级延迟和吞吐量要求下,其计算开销过高。特别是交叉编码器排序系统通常会产生长上下文、预填充密集型工作负载,因为模型必须同时处理用户、查询和物品信息。为此,我们提出MixLM——一种基于LLM的新型排序框架,该框架通过缩短输入上下文长度显著提升系统吞吐量,同时保持交叉编码器排序器的语义强度。与标准排序系统将上下文以纯文本形式输入模型不同,我们提出使用混合交互——即文本与嵌入标记的混合表示来编码输入。具体而言,MixLM将目录中的所有物品编码为少量嵌入标记并存储于近线缓存中。在线推理时使用编码后的物品描述,从而将物品长度从数千个文本标记有效缩减至数个嵌入标记。我们分享了在LinkedIn实际搜索应用中部署MixLM框架的实践经验,包括对训练流程的详细探讨,以及在线服务基础设施优化的全面分析。在相同延迟预算和相当的相关性指标下,MixLM相比强基线实现了10.0倍的吞吐量提升,相较于全文本LLM重排序器提升达75.9倍。MixLM带来的效率增益使得支持LLM的搜索系统得以全流量部署,在线A/B测试中每日活跃用户数(DAU)因此显著提升0.47%。