Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

Pretrained imitation policies have become a strong foundation for robot manipulation, but they often require online improvement to overcome execution errors, limited dataset coverage, and deployment mismatch. A central question is therefore how reinforcement learning (RL) should adapt policies after offline pretraining. Existing lightweight methods commonly apply residual corrections directly in action space, but this often leads to noisy and poorly structured exploration. In this work, we propose Z-Perturbation Reinforcement Learning (ZPRL), an approach that steers pretrained policies through a compact bottleneck latent rather than through policy weights or output actions. During offline training, we augment the policy with a plug-and-play variational information bottleneck (VIB) module to extract a task-relevant latent interface from observation embeddings. During online finetuning, the base policy is frozen and RL learns only a residual perturbation on this latent, whose decoded representation conditions the frozen action generator. We instantiate ZPRL on flow-matching policies and evaluate it on eight simulation tasks and four real-world tasks. Across diverse manipulation settings, ZPRL improves both sample efficiency and final performance over strong post-training baselines. In the real world, ZPRL improves the average success rate on four tasks by 33.7% over imitation base policies while producing smoother exploration behaviors than an action residual counterpart. These results suggest that a compact, task-aligned bottleneck latent provides an effective interface for online RL adaptation. More videos can be found at https://manutdmoon.github.io/ZPRL/.

翻译：预训练的模仿策略已成为机器人操作任务的强大基础，但通常需要在线改进以克服执行误差、数据集覆盖不足及部署环境不匹配的问题。核心问题在于：强化学习应如何在离线预训练后对策略进行自适应调整？现有轻量级方法通常在动作空间直接应用残差修正，但这往往导致噪声大且结构化不良的探索。为此，我们提出Z-扰动强化学习（ZPRL）方法，通过紧凑的瓶颈潜变量而非策略权重或输出动作对预训练策略进行引导。在离线训练阶段，我们为策略添加即插即用的变分信息瓶颈（VIB）模块，从观测嵌入中提取任务相关的潜变量接口。在线微调时，基础策略被冻结，强化学习仅学习对该潜变量的残差扰动，其解码后的表示条件作用于冻结的动作生成器。我们在流匹配策略上实例化ZPRL，并在8个仿真任务和4个真实世界任务中进行了评估。在多种操作场景下，ZPRL相较于强后训练基线同时提升了样本效率与最终性能。在真实环境中，ZPRL在四项任务上的平均成功率相较模仿基策略提升33.7%，同时产生比动作残差方法更平滑的探索行为。结果表明，紧凑且任务对齐的瓶颈潜变量为在线强化学习自适应提供了有效接口。更多视频见https://manutdmoon.github.io/ZPRL/。