As a vital step toward the intelligent agent, Action understanding matters for intelligent agents and has attracted long-term attention. It can be formed as the mapping from the action physical space to the semantic space. Typically, researchers built action datasets according to idiosyncratic choices to define classes and push the envelope of benchmarks respectively. Thus, datasets are incompatible with each other like "Isolated Islands" due to semantic gaps and various class granularities, e.g., do housework in dataset A and wash plate in dataset B. We argue that a more principled semantic space is an urgent need to concentrate the community efforts and enable us to use all datasets together to pursue generalizable action learning. To this end, we design a structured action semantic space in view of verb taxonomy hierarchy and covering massive actions. By aligning the classes of previous datasets to our semantic space, we gather (image/video/skeleton/MoCap) datasets into a unified database in a unified label system, i.e., bridging ``isolated islands'' into a "Pangea". Accordingly, we propose a novel model mapping from the physical space to semantic space to fully use Pangea. In extensive experiments, our new system shows significant superiority, especially in transfer learning. Code and data will be made publicly available.
翻译:作为迈向智能体的关键步骤,行为理解对智能体至关重要并长期受到关注。该任务可形式化为从行为物理空间到语义空间的映射。传统上,研究者依据各自标准构建行为数据集以定义类别并推动基准测试发展。然而,由于语义差异与不同类别粒度(例如数据集A中的"做家务"与数据集B中的"洗碗"),各数据集如同"孤岛"般互不兼容。我们认为亟需构建更具原则性的语义空间,以凝聚学界力量并实现多数据集联合训练的可泛化行为学习。为此,我们基于动词分类层级结构设计了一个结构化行为语义空间,覆盖海量行为类别。通过将现有数据集的类别映射至该语义空间,我们以统一标签系统将(图像/视频/骨架/动作捕捉)数据集纳入统一数据库,即连接"孤岛"形成"盘古大陆"。据此,我们提出一种从物理空间到语义空间的新型映射模型以充分利用融合数据。大量实验表明,新系统展现出显著优势,尤其在迁移学习领域。相关代码与数据将公开提供。