Generative recommendation models sequence generation to produce items end-to-end, but training from behavioral logs often provides weak supervision on underlying user intent. Although Large Language Models (LLMs) offer rich semantic priors that could supply such supervision, direct adoption in industrial recommendation is hindered by two obstacles: semantic signals can conflict with platform business objectives, and LLM inference is prohibitively expensive at scale. This paper presents S-GRec, a semantic-aware framework that decouples an online lightweight generator from an offline LLM-based semantic judge for train-time supervision. S-GRec introduces a two-stage Personalized Semantic Judge (PSJ) that produces interpretable aspect evidence and learns user-conditional aggregation from pairwise feedback, yielding stable semantic rewards. To prevent semantic supervision from deviating from business goals, Asymmetric Advantage Policy Optimization (A2PO) anchors optimization on business rewards (e.g., eCPM) and injects semantic advantages only when they are consistent. Extensive experiments on public benchmarks and a large-scale production system validate both effectiveness and scalability, including statistically significant gains in CTR and a 1.19\% lift in GMV in online A/B tests, without requiring real-time LLM inference.
翻译:生成式推荐通过序列生成端到端地生成物品,但基于行为日志的训练通常对潜在用户意图提供弱监督。尽管大型语言模型(LLM)能够提供丰富的语义先验以补充此类监督,但其在工业推荐中的直接应用面临两大障碍:语义信号可能与平台商业目标冲突,且LLM推理在大规模场景下成本过高。本文提出S-GRec——一种语义感知框架,通过将在线轻量生成器与基于LLM的离线语义评判器解耦,实现训练阶段的监督。S-GRec引入两阶段个性化语义评判器(PSJ),该模块生成可解释的方面证据,并从成对反馈中学习用户条件聚合,从而产生稳定的语义奖励。为防止语义监督偏离商业目标,非对称优势策略优化(A2PO)将优化过程锚定于商业奖励(如eCPM),仅在语义优势与商业目标一致时注入语义信号。在公开基准和大规模生产系统上的大量实验验证了该框架的有效性与可扩展性,包括在线A/B测试中点击率(CTR)的统计显著提升及商品交易总额(GMV)1.19%的增长,且无需实时LLM推理。