In multi-stage recommender systems, reranking optimizes overall utility by capturing intra-list contextual dependencies, yet its central challenge lies in exploring optimal sequences within an exponentially large permutation space. Recent studies have shifted towards end-to-end generative frameworks, which typically leverage list-wise rewards or preference alignment to guide generator training. However, these methods still face two critical issues. First is the heuristic label bias. Existing methods often construct training targets based on simple rules, such as promoting clicked items to the top, while ignoring causal dependencies within the list context. Second is the credit assignment problem. Sparse list-level posterior rewards fail to directly guide intermediate steps in sequence generation, leading to ambiguous optimization directions. To address these issues, we propose DeGRe (Dense-supervised Generative Reranking), a generative reranking framework that bridges the gap between offline exploration and online efficiency through dense supervision. The core of DeGRe lies in its offline-online decoupled design. During the offline phase, we introduce a Lookahead Evaluator based on cumulative regression, which leverages beam search to actively mine high-value lookahead sequences in the unexposed space. During training, we transform the step-wise value estimations from the evaluator into dense supervision signals and distill them into a lightweight Online Generator. This mechanism enables the generator to internalize lookahead planning capabilities, requiring only a single efficient greedy decoding pass during online inference to approximate the global optimum. Experiments demonstrate that DeGRe outperforms baseline models on public benchmarks and industrial datasets. We have successfully deployed DeGRe on Taobao Flash Shopping, significantly improving online recommendations.
翻译:在多层次推荐系统中,重排序通过捕捉列表内上下文依赖关系来优化整体效用,但其核心挑战在于如何在指数级增长的排列空间中探索最优序列。近年研究转向端到端生成式框架,通常利用列表级奖励或偏好对齐指导生成器训练。然而,此类方法仍面临两个关键问题:一是启发式标签偏差——现有方法常基于简单规则(如将点击项提升至顶部)构建训练目标,忽略列表上下文中的因果依赖关系;二是信用分配问题——稀疏的列表级后验奖励无法直接指导序列生成中的中间步骤,导致优化方向模糊。针对上述问题,我们提出DeGRe(密集监督生成式重排序),一种通过密集监督衔接离线探索与在线效率的生成式重排序框架。其核心在于离线-在线解耦设计:离线阶段,我们引入基于累积回归的前瞻评估器,利用束搜索主动挖掘未曝光空间中的高价值前瞻序列;训练阶段,将评估器的逐步价值估计转化为密集监督信号,并蒸馏至轻量级在线生成器中。该机制使生成器内化前瞻规划能力,在线推理时仅需单次高效贪心解码即可逼近全局最优解。实验表明,DeGRe在公开基准和工业数据集上均优于基线模型。我们已在淘宝闪购场景成功部署DeGRe,显著提升了在线推荐效果。