In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) is attractive only if policies achieve high returns without catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of either (i) value/model-based pessimism or (ii) restricted policy classes that limit expressiveness, whereas diffusion/flow-based expressive generative policies have largely been used in risk-neutral settings. We introduce \textbf{Risk-Aware Multimodal Actor-Critic (RAMAC)}, a simple, modular, model-free framework that couples an expressive generative actor (e.g., diffusion/flow) with a distributional critic and optimizes a composite objective that combines Conditional Value-at-Risk (CVaR) with behavioral cloning (BC), enabling risk-sensitive learning in complex multimodal scenarios. Since out-of-distribution (OOD) actions are a major driver of catastrophic failures in offline RL, we further provide an objective-level analysis showing that controlling behavior divergence via BC suppresses OOD actions and stabilizes CVaR. Instantiating RAMAC with a diffusion actor, we illustrate these insights on a 2-D risky bandit and evaluate on Stochastic-D4RL, observing consistent gains in $\mathrm{CVaR}_{0.1}$ while maintaining strong returns. The code and experimental results are available on the \href{https://kaifukazawa.github.io/ramac-project/} {project website}
翻译:在无法进行在线数据收集的安全关键领域,离线强化学习仅当策略能实现高回报且避免灾难性的下尾风险时才具有吸引力。先前关于风险规避离线强化学习的研究往往以牺牲(i)基于值/模型的悲观主义或(ii)限制表达能力的受限策略类为代价来获得安全性,而基于扩散/流的表达性生成策略主要被用于风险中立的场景。我们提出**风险感知多模态演员-评论家(RAMAC)**,一个简单、模块化、无模型的框架,它将表达性生成演员(例如扩散/流)与分布性评论家相结合,并优化一个组合条件风险价值与行为克隆的复合目标函数,从而在复杂多模态场景中实现风险敏感学习。由于分布外动作是离线强化学习中灾难性失败的主要驱动因素,我们进一步从目标层面分析表明,通过行为克隆控制行为发散能抑制分布外动作并稳定条件风险价值。通过将RAMAC实例化为扩散演员,我们在二维风险赌徒问题上阐释了这些见解,并在随机化D4RL基准上进行了评估,观察到在保持高回报的同时,$\mathrm{CVaR}_{0.1}$指标持续提升。代码和实验结果可在项目网站上获取。