Offline Reinforcement Learning (ORL) holds immense promise for safety-critical domains like industrial robotics, where real-time environmental interaction is often prohibitive. A primary obstacle in ORL remains the distributional shift between the static dataset and the learned policy, which typically mandates high degrees of conservatism that can restrain potential policy improvements. We present MoReBRAC, a model-based framework that addresses this limitation through Uncertainty-Aware latent synthesis. Instead of relying solely on the fixed data, MoReBRAC utilizes a dual-recurrent world model to synthesize high-fidelity transitions that augment the training manifold. To ensure the reliability of this synthetic data, we implement a hierarchical uncertainty pipeline integrating Variational Autoencoder (VAE) manifold detection, model sensitivity analysis, and Monte Carlo (MC) dropout. This multi-layered filtering process guarantees that only transitions residing within high-confidence regions of the learned dynamics are utilized. Our results on D4RL Gym-MuJoCo benchmarks reveal significant performance gains, particularly in ``random'' and ``suboptimal'' data regimes. We further provide insights into the role of the VAE as a geometric anchor and discuss the distributional trade-offs encountered when learning from near-optimal datasets.
翻译:离线强化学习(ORL)在工业机器人等安全关键领域具有巨大潜力,因为这些领域通常无法进行实时环境交互。ORL的主要障碍仍然是静态数据集与学习策略之间的分布偏移,这通常要求高度保守的策略,从而限制了策略的潜在改进。我们提出了MoReBRAC,这是一个基于模型的框架,通过不确定性感知的潜在合成来解决这一限制。MoReBRAC不仅依赖固定数据,还利用双循环世界模型合成高保真转移,以扩展训练流形。为确保合成数据的可靠性,我们实现了一个分层不确定性管道,集成了变分自编码器(VAE)流形检测、模型敏感性分析和蒙特卡洛(MC)丢弃法。这种多层过滤过程保证了仅使用位于学习动态高置信区域的转移。我们在D4RL Gym-MuJoCo基准测试上的结果显示出了显著的性能提升,尤其是在“随机”和“次优”数据机制中。我们进一步深入探讨了VAE作为几何锚点的作用,并讨论了从接近最优数据集中学习时遇到的分布权衡。