Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM's semantic space with the recommender's ID space via post-training (e.g., SFT and RL) remains challenging. Existing LLM4Rec paradigms are bottlenecked by two main issues: (1) the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during SFT, and (2) the neglect of the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment. Inspired by these challenges, we present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems. To overcome the SFT bottleneck, we utilize reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific CoT data. To resolve the RL alignment issue, we propose Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights. Theoretically, it achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences. Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji. Deployed on Kuaishou's advertising platform since May 2026, Taiji currently serves over 400 million users daily, yielding significant commercial revenue and demonstrating its robust scalability in web-scale environments.
翻译:通過大型語言模型(LLM)擴展推薦系統已成為工業界的顯著趨勢。然而,通過後訓練(如監督微調和強化學習)將LLM的語義空間與推薦系統的ID空間對齊仍具挑戰性。現有LLM4Rec範式受兩大問題制約:(1)在開放域推薦的監督微調過程中難以量化與改進思維鏈(CoT)品質;(2)強化學習對齊階段忽略了LLM語義獎勵與推薦偏好獎勵之間的權衡。受此啟發,我們提出太極——一種面向工業推薦系統的新型LLM增強框架。為突破監督微調瓶頸,我們利用逆向推理與開放式拒絕採樣生成高質量領域專用CoT數據。針對強化學習對齊問題,我們提出帕累托最優策略優化(POPO),該方法自適應調整跨域獎勵權重,理論上實現了LLM語義世界知識與代表線上用戶偏好的協同ID特徵之間的最優權衡。大量離線評估與線上A/B測試驗證了太極的有效性。自2026年5月部署於快手廣告平台以來,太極每日服務超過4億用戶,創造顯著商業收益,展現了其在網絡規模環境下的強大可擴展性。