面向自主叉车长时域多目标任务的异构多专家强化学习 (Heterogeneous Multi-Expert Reinforcement Learning for Long-Horizon Multi-Goal Tasks in Autonomous Forklifts)

Autonomous mobile manipulation in unstructured warehouses requires a balance between efficient large-scale navigation and high-precision object interaction. Traditional end-to-end learning approaches often struggle to handle the conflicting demands of these distinct phases. Navigation relies on robust decision-making over large spaces, while manipulation needs high sensitivity to fine local details. Forcing a single network to learn these different objectives simultaneously often causes optimization interference, where improving one task degrades the other. To address these limitations, we propose a Heterogeneous Multi-Expert Reinforcement Learning (HMER) framework tailored for autonomous forklifts. HMER decomposes long-horizon tasks into specialized sub-policies controlled by a Semantic Task Planner. This structure separates macro-level navigation from micro-level manipulation, allowing each expert to focus on its specific action space without interference. The planner coordinates the sequential execution of these experts, bridging the gap between task planning and continuous control. Furthermore, to solve the problem of sparse exploration, we introduce a Hybrid Imitation-Reinforcement Training Strategy. This method uses expert demonstrations to initialize the policy and Reinforcement Learning for fine-tuning. Experiments in Gazebo simulations show that HMER significantly outperforms sequential and end-to-end baselines. Our method achieves a task success rate of 94.2\% (compared to 62.5\% for baselines), reduces operation time by 21.4\%, and maintains placement error within 1.5 cm, validating its efficacy for precise material handling.

翻译：在非结构化仓库环境中，自主移动操作需要在高效的大范围导航与高精度的物体交互之间取得平衡。传统的端到端学习方法往往难以应对这两个截然不同阶段的冲突性需求。导航依赖于在大空间内进行鲁棒决策，而操作则需要对精细的局部细节保持高度敏感。迫使单一网络同时学习这些不同目标通常会导致优化干扰，即改进一项任务会损害另一项任务。为应对这些局限性，我们提出了一种专为自主叉车设计的异构多专家强化学习框架。该框架将长时域任务分解为由语义任务规划器控制的专用子策略。这种结构将宏观层面的导航与微观层面的操作分离开来，使得每个专家能够专注于其特定的动作空间而不受干扰。规划器协调这些专家的顺序执行，从而弥合了任务规划与连续控制之间的鸿沟。此外，为解决稀疏探索问题，我们引入了一种混合模仿-强化训练策略。该方法利用专家演示来初始化策略，并采用强化学习进行微调。在Gazebo仿真环境中的实验表明，HMER显著优于顺序执行和端到端的基线方法。我们的方法实现了94.2%的任务成功率（基线方法为62.5%），操作时间减少了21.4%，并将放置误差保持在1.5厘米以内，验证了其在精确物料搬运任务中的有效性。