As transformer architectures and dataset sizes continue to scale, the need to understand the specific dataset factors affecting model performance becomes increasingly urgent. This paper investigates how object physics attributes (color, friction coefficient, shape) and background characteristics (static, dynamic, background complexity) influence the performance of Video Transformers in trajectory prediction tasks under occlusion. Beyond mere occlusion challenges, this study aims to investigate three questions: How do object physics attributes and background characteristics influence the model performance? What kinds of attributes are most influential to the model generalization? Is there a data saturation point for large transformer model performance within a single task? To facilitate this research, we present OccluManip, a real-world video-based robot pushing dataset comprising 460,000 consistent recordings of objects with different physics and varying backgrounds. 1.4 TB and in total 1278 hours of high-quality videos of flexible temporal length along with target object trajectories are collected, accommodating tasks with different temporal requirements. Additionally, we propose Video Occlusion Transformer (VOT), a generic video-transformer-based network achieving an average 96% accuracy across all 18 sub-datasets provided in OccluManip. OccluManip and VOT will be released at: https://github.com/ShutongJIN/OccluManip.git
翻译:随着Transformer架构和数据集规模持续扩展,理解影响模型性能的具体数据集因素变得日益迫切。本文探究物体物理属性(颜色、摩擦系数、形状)与背景特征(静态、动态、背景复杂度)如何影响视频Transformer在遮挡条件下轨迹预测任务中的表现。本研究超越单纯的遮挡挑战,旨在回答三个问题:物体物理属性及背景特征如何影响模型性能?哪些属性对模型泛化能力影响最大?大型Transformer模型在单一任务中是否存在数据饱和点?为推进此项研究,我们提出OccluManip——一个基于真实世界视频的机器人推碰数据集,包含46万段具有不同物理属性和背景的物体一致性记录。该数据集包含1.4 TB、总计1278小时的高质量视频,其时序长度具有灵活性,并附带目标物体轨迹,可适应不同时序需求的任务。此外,我们提出视频遮挡Transformer(VOT),这是一种基于视频Transformer的通用网络,在OccluManip提供的全部18个子数据集上平均准确率达96%。OccluManip与VOT将发布于:https://github.com/ShutongJIN/OccluManip.git