将搜索资源用在刀刃上：面向生成式推荐的价值引导结构化采样与优化 (Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation)

Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.

翻译：基于自回归模型的生成式推荐将检索与排序统一至单一条件生成框架。然而，使用强化学习（RL）对这些模型进行微调时，常面临根本性的概率-奖励失配问题。传统的似然主导解码方法（如束搜索）表现出对局部高概率前缀的短视偏好，导致两个关键缺陷：（1）探索不足：低概率分支中的高奖励项目被过早剪枝，极少被采样；（2）优势压缩：共享高概率前缀的轨迹获得高度相关的奖励，组内方差较低，从而为强化学习提供微弱的比较信号。为解决这些挑战，我们提出V-STAR框架——一种价值引导的采样与树状结构优势强化框架。V-STAR通过两个协同组件形成自演化循环。首先，我们开发了价值引导高效解码（VED）方法，用于识别关键决策节点并选择性深化高潜力前缀。这在不进行穷举树搜索的前提下提升了探索效率。其次，我们提出Sibling-GRPO方法，利用推导出的树状拓扑结构计算兄弟节点相对优势，并将学习信号聚焦于关键分支决策。在离线和在线数据集上的大量实验表明，V-STAR在严格延迟约束下，以更优的准确率和候选集多样性超越了现有最先进基线方法。