As model and dataset sizes continue to scale in robot learning, the need to understand what is the specific factor in the dataset that affects model performance becomes increasingly urgent to ensure cost-effective data collection and model performance. In this work, we empirically investigate how physics attributes (color, friction coefficient, shape) and scene background characteristics, such as the complexity and dynamics of interactions with background objects, influence the performance of Video Transformers in predicting planar pushing trajectories. We aim to investigate three primary questions: How do physics attributes and background scene characteristics influence model performance? What kind of changes in attributes are most detrimental to model generalization? What proportion of fine-tuning data is required to adapt models to novel scenarios? To facilitate this research, we present CloudGripper-Push-1K, a large real-world vision-based robot pushing dataset comprising 1278 hours and 460,000 videos of planar pushing interactions with objects with different physics and background attributes. We also propose Video Occlusion Transformer (VOT), a generic modular video-transformer-based trajectory prediction framework which features 3 choices of 2D-spatial encoders as the subject of our case study. Dataset and codes will be available at https://cloudgripper.org.
翻译:随着机器人学习中模型与数据集规模的持续扩展,理解数据集中影响模型性能的具体因素变得愈发迫切,以确保数据采集的经济性和模型性能。本研究通过实证方法探究物理属性(颜色、摩擦系数、形状)及场景背景特征(如背景物体交互的复杂性与动态性)如何影响基于视频变换器的平面推搡轨迹预测性能。我们聚焦三个核心问题:物理属性与背景场景特征如何影响模型性能?何种属性变化对模型泛化能力最具破坏性?需要多大比例的微调数据才能使模型适应新场景?为支撑此项研究,我们提出CloudGripper-Push-1K——一个大规模基于真实视觉的机器人推搡数据集,包含1278小时、46万段视频,记录了对不同物理属性与背景特征物体进行的平面推搡交互。我们同时提出视频遮挡变换器(Video Occlusion Transformer, VOT),一种通用模块化基于视频变换器的轨迹预测框架,其包含三种可选的二维空间编码器作为本案例研究对象。数据集与代码将发布于https://cloudgripper.org。