Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors during deployment and require large amounts of expert data to achieve reliable performance. To move beyond the limitations of demonstration data, in this work, we propose DexPIE, a post-training framework for dexterous policy improvement from experience collected through real-world deployment. First, DexPIE enables effective exploration coverage through a dexterous-hand-adapted intervention system and multi-stage DAgger-style data collection across initial and intermediate task stages, providing reliable supervision for accurate policy evaluation. To reduce temporal noise between post-training rollouts and demonstration data, we introduce asynchronous inference in the relative action space, which better aligns rollout data with demonstrated behavior and allows the critic to learn a value function induced by a more consistent underlying policy. Finally, DexPIE improves the policy through conditioning on a continuous optimality indicator, allowing the policy to leverage the quality of data in a more fine-grained manner. Across three challenging real-world dexterous manipulation tasks, DexPIE achieves a 37% improvement in success rate over the demonstration-based reference policy, outperforming all baseline methods and demonstrating stronger robustness. The source code and dataset will be made publicly available.
翻译:摘要:灵巧操控因其高维动作空间和复杂的接触动力学特性,为模仿学习带来了巨大挑战。仅依赖演示数据训练的策略在部署时容易产生累积误差,且需要大量专家数据才能达到可靠性能。为突破演示数据的局限性,本文提出DexPIE——一种基于真实世界部署经验的后训练框架,用于提升灵巧操控策略。首先,DexPIE通过适配灵巧手的干预系统以及覆盖任务初始与中间阶段的多级DAgger式数据采集,实现有效的探索覆盖,为精确的策略评估提供可靠监督。为减少后训练轨迹与演示数据之间的时序噪声,我们引入相对动作空间的异步推理,该方法能更好对齐轨迹数据与演示行为,使评价器学习由更一致底层策略诱导的值函数。最终,DexPIE通过连续最优性指标条件化策略改进,使策略能更细粒度地利用数据质量。在三个具有挑战性的真实灵巧操控任务中,DexPIE相比基于演示的参考策略实现了37%的成功率提升,全面超越所有基线方法并展现出更强的鲁棒性。源代码与数据集将公开发布。