GenRec: A Preference-Oriented Generative Framework for Large-Scale Recommendation

Generative Retrieval (GR) offers a promising paradigm for recommendation through next-token prediction (NTP). However, scaling it to large-scale industrial systems introduces three challenges: (i) within a single request, the identical model inputs may produce inconsistent outputs due to the pagination request mechanism; (ii) the prohibitive cost of encoding long user behavior sequences with multi-token item representations based on semantic IDs, and (iii) aligning the generative policy with nuanced user preference signals. We present GenRec, a preference-oriented generative framework deployed on the JD App that addresses above challenges within a single decoder-only architecture. For training objective, we propose Page-wise NTP task, which supervises over an entire interaction page rather than each interacted item individually, providing denser gradient signal and resolving the one-to-many ambiguity of point-wise training. On the prefilling side, an asymmetric linear Token Merger compresses multi-token Semantic IDs in the prompt while preserving full-resolution decoding, reducing input length by ~2X with negligible accuracy loss. To further align outputs with user satisfaction, we introduce GRPO-SR, a reinforcement learning method that pairs Group Relative Policy Optimization with NLL regularization for training stability, and employs Hybrid Rewards combining a dense reward model with a relevance gate to mitigate reward hacking. In month-long online A/B tests serving production traffic, GenRec achieves 9.5% improvement in click count and 8.7% in transaction count over the existing pipeline.

翻译：生成式检索（GR）通过下一词元预测（NTP）为推荐系统提供了一种有前景的范式。然而，将其扩展到大规模工业系统面临三大挑战：（i）在单个请求中，由于分页请求机制，相同的模型输入可能产生不一致的输出；（ii）基于语义ID的多词元项目表示对长用户行为序列进行编码的成本过高；（iii）生成策略与细粒度用户偏好信号的对齐问题。我们提出GenRec——一个部署于京东App的偏好导向生成式框架，采用纯解码器架构解决上述挑战。在训练目标方面，我们提出页面级NTP任务，该任务对完整交互页面而非单个交互项目进行监督，提供更密集的梯度信号并消除逐点训练中的一对多歧义性。在预填充阶段，非对称线性词元合并器在压缩提示中多词元语义ID的同时保持全分辨率解码，将输入长度缩减约2倍且精度损失可忽略。为进一步使输出与用户满意度对齐，我们引入GRPO-SR——一种强化学习方法，将分组相对策略优化与NLL正则化相结合以保证训练稳定性，并采用结合密集奖励模型与相关性门控的混合奖励机制以缓解奖励篡改。在生产流量长达一个月的在线A/B测试中，GenRec相较于现有流水线在点击次数上提升9.5%，交易次数上提升8.7%。