Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report a state-conditional out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks, LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines in our implementation while honoring the desired alpha. Theoretically, we establish level-alpha calibration, concise stability bounds, and a return comparison showing when LRT surpasses Q-guidance-especially when off-support errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL.
翻译:扩散策略在离线强化学习(RL)中具有竞争力,但其在采样时通常由缺乏统计风险概念的启发式方法引导。我们提出了LRT-Diffusion,一种风险感知的采样规则,它将每个去噪步骤视为无条件先验与状态条件策略头之间的序贯假设检验。具体而言,我们累积对数似然比,并使用逻辑控制器对条件均值进行门控,该控制器的阈值τ在H0假设下一次校准,以满足用户指定的I类错误水平α。这将引导从固定的“推动”转变为具有用户可解释风险预算的证据驱动调整。重要的是,我们在DDPM结构下刻意保持训练过程的原始性(采用标准ε预测的双头结构)。LRT引导与Q梯度自然结合:评论家梯度更新可在无条件均值、LRT门控均值或两者的混合处进行,从而展现从利用到保守的连续谱。我们在训练和测试时对状态与动作进行一致标准化,并在汇报回报的同时报告状态条件分布外(OOD)度量。在D4RL MuJoCo任务中,LRT-Diffusion在我们的实现中相较于强Q引导基线改善了回报-OOD权衡关系,同时严格遵守设定的α水平。理论上,我们建立了水平α校准、简洁的稳定性边界以及回报比较分析,证明LRT何时优于Q引导——尤其是在分布外误差占主导时。总体而言,LRT-Diffusion是一种即插即用的推理时方法,可为离线RL的扩散策略提供原则性、校准化的风险控制。