The performance of robotic imitation learning is fundamentally limited by data quality and training strategies. Prevalent sampling strategies on RLBench suffer from severe keyframe redundancy and imbalanced temporal distribution, leading to inefficient memory usage and unstable optimization. Moreover, reprojecting point clouds onto multi-view images with a black background--while more efficient than voxel-based methods--often causes dark objects to be indistinguishable and hard to manipulate. In this work, we propose a novel holistic framework that significantly improves both model performance and training efficiency. First, we redesign and optimize the keyframe sampling strategy, reducing memory consumption by 80% and accelerating training speed by 5x. Second, we augment the model with a color inversion projection branch--a simple yet effective module that resolves the ambiguity of dark objects. Finally, we propose a task-guided mixup technique that dynamically fuses point clouds and action heatmaps according to task instructions, greatly improving robustness to distractors and performance in multi-goal scenarios. Extensive experiments demonstrate that our method achieves state-of-the-art performance with a 90.5% success rate on RLBench and 68.8% on the COLOSSEUM benchmark under challenging interference conditions. Our code and checkpoints are available at https://github.com/PuFanqi23/TGM-VLA.
翻译:机器人模仿学习的性能从根本上受限于数据质量和训练策略。当前RLBench中主流的采样策略存在严重的关键帧冗余与时间分布不平衡问题,导致内存使用效率低下且优化过程不稳定。此外,将点云重投影至黑色背景的多视角图像——虽然比基于体素的方法更高效——常导致深色物体难以辨识与操作。本研究提出一种新颖的整体框架,显著提升了模型性能与训练效率。首先,我们重新设计并优化了关键帧采样策略,将内存消耗降低80%,训练速度提升5倍。其次,我们为模型增加了颜色反转投影分支——这是一个简单而有效的模块,能有效解决深色物体的辨识模糊问题。最后,我们提出任务引导混合增强技术,可根据任务指令动态融合点云与动作热力图,极大提升了模型对干扰物的鲁棒性及在多目标场景中的表现。大量实验表明,我们的方法在RLBench上以90.5%的成功率、在COLOSSEUM基准测试的挑战性干扰条件下以68.8%的成功率实现了最先进的性能。代码与模型检查点已开源:https://github.com/PuFanqi23/TGM-VLA。