The availability of a large labeled dataset is a key requirement for applying deep learning methods to solve various computer vision tasks. In the context of understanding human activities, existing public datasets, while large in size, are often limited to a single RGB camera and provide only per-frame or per-clip action annotations. To enable richer analysis and understanding of human activities, we introduce IKEA ASM -- a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose. Additionally, we benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset. The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
翻译:大型标注数据集的可用性是应用深度学习方法解决各类计算机视觉问题的关键前提。在理解人类活动的背景下,现有公共数据集尽管规模庞大,但通常局限于单RGB摄像头,仅提供逐帧或逐片段动作标注。为促进对人类活动更深入的分析与理解,我们提出了IKEA ASM——一个包含三百万帧、多视角的家具组装视频数据集,涵盖深度信息、原子动作、物体分割及人体姿态数据。此外,我们针对这一具有挑战性的数据集,对视频动作识别、物体分割和人体姿态估计等任务的代表性方法进行了基准测试。该数据集支持开发融合多模态与多视角数据的整体性方法,从而更优地完成上述任务。