Recently, increasing attention has been paid to LLM-based recommender systems, but their deployment is still under exploration in the industry. Most deployments utilize LLMs as feature enhancers, generating augmentation knowledge in the offline stage. However, in recommendation scenarios, involving numerous users and items, even offline generation with LLMs consumes considerable time and resources. This generation inefficiency stems from the autoregressive nature of LLMs, and a promising direction for acceleration is speculative decoding, a Draft-then-Verify paradigm that increases the number of generated tokens per decoding step. In this paper, we first identify that recommendation knowledge generation is suitable for retrieval-based speculative decoding. Then, we discern two characteristics: (1) extensive items and users in RSs bring retrieval inefficiency, and (2) RSs exhibit high diversity tolerance for text generated by LLMs. Based on the above insights, we propose a Decoding Acceleration Framework for LLM-based Recommendation (dubbed DARE), with Customized Retrieval Pool to improve retrieval efficiency and Relaxed Verification to increase the acceptance rate of draft tokens, respectively. Extensive experiments demonstrate that DARE achieves a 3-5x speedup and is compatible with various frameworks and backbone LLMs. DARE has also been deployed to online advertising scenarios within a large-scale commercial environment, achieving a 3.45x speedup while maintaining the downstream performance.
翻译:近年来,基于大语言模型的推荐系统日益受到关注,但其在工业界的部署仍处于探索阶段。多数部署方案将大语言模型用作特征增强器,在离线阶段生成增强知识。然而,在涉及海量用户与物品的推荐场景中,即使采用离线生成方式,大语言模型仍需消耗大量时间与计算资源。这种生成低效性源于大语言模型的自回归特性,而推测解码作为一种“先草稿后验证”的范式,通过提升单解码步骤的生成令牌数量,为加速提供了可行方向。本文首先指出推荐知识生成任务适用于基于检索的推测解码方法。进而,我们识别出两大特征:(1) 推荐系统中庞大的物品与用户规模导致检索效率低下;(2) 推荐系统对模型生成的文本具有较高的多样性容忍度。基于上述发现,我们提出了面向大语言模型推荐系统的解码加速框架(简称DARE),该框架分别通过定制化检索池提升检索效率,以及通过松弛验证机制提高草稿令牌的接受率。大量实验表明,DARE实现了3-5倍的加速效果,且兼容多种框架与骨干大语言模型。该框架已成功部署于大规模商业环境的在线广告场景,在保持下游任务性能的同时实现了3.45倍的加速比。