High-level robot skills represent an increasingly popular paradigm in robot programming. However, configuring the skills' parameters for a specific task remains a manual and time-consuming endeavor. Existing approaches for learning or optimizing these parameters often require numerous real-world executions or do not work in dynamic environments. To address these challenges, we propose MuTT, a novel encoder-decoder transformer architecture designed to predict environment-aware executions of robot skills by integrating vision, trajectory, and robot skill parameters. Notably, we pioneer the fusion of vision and trajectory, introducing a novel trajectory projection. Furthermore, we illustrate MuTT's efficacy as a predictor when combined with a model-based robot skill optimizer. This approach facilitates the optimization of robot skill parameters for the current environment, without the need for real-world executions during optimization. Designed for compatibility with any representation of robot skills, MuTT demonstrates its versatility across three comprehensive experiments, showcasing superior performance across two different skill representations.
翻译:高级机器人技能已成为机器人编程中日益流行的范式。然而,针对特定任务配置技能参数仍然是一项手动且耗时的任务。现有用于学习或优化这些参数的方法通常需要大量真实世界执行,或无法在动态环境中工作。为应对这些挑战,我们提出了MuTT,一种新颖的编码器-解码器Transformer架构,旨在通过融合视觉、轨迹和机器人技能参数来预测环境感知的机器人技能执行。值得注意的是,我们率先实现了视觉与轨迹的融合,并引入了一种新颖的轨迹投影方法。此外,我们阐述了MuTT在与基于模型的机器人技能优化器结合时作为预测器的有效性。该方法便于针对当前环境优化机器人技能参数,而无需在优化过程中进行真实世界执行。MuTT设计为兼容任何机器人技能表示形式,并通过三项综合实验展示了其多功能性,在两种不同的技能表示上均表现出卓越性能。