Action understanding has attracted long-term attention. It can be formed as the mapping from the physical space to the semantic space. Typically, researchers built datasets according to idiosyncratic choices to define classes and push the envelope of benchmarks respectively. Datasets are incompatible with each other like "Isolated Islands" due to semantic gaps and various class granularities, e.g., do housework in dataset A and wash plate in dataset B. We argue that we need a more principled semantic space to concentrate the community efforts and use all datasets together to pursue generalizable action learning. To this end, we design a structured action semantic space given verb taxonomy hierarchy and covering massive actions. By aligning the classes of previous datasets to our semantic space, we gather (image/video/skeleton/MoCap) datasets into a unified database in a unified label system, i.e., bridging "isolated islands" into a "Pangea". Accordingly, we propose a novel model mapping from the physical space to semantic space to fully use Pangea. In extensive experiments, our new system shows significant superiority, especially in transfer learning. Our code and data will be made public at https://mvig-rhos.com/pangea.
翻译:动作理解长期以来受到关注,其可形式化为从物理空间到语义空间的映射。通常,研究者根据各自的选择构建数据集以定义类别并分别推动基准测试的发展。由于语义鸿沟和类别粒度差异(例如数据集A中的“做家务”与数据集B中的“洗盘子”),各数据集如“孤立岛屿”般互不兼容。我们认为需要更规范的语义空间来凝聚学界力量,并整合所有数据集以实现可泛化的动作学习。为此,我们基于动词分类层级结构设计了一个结构化动作语义空间,覆盖海量动作类别。通过将现有数据集的类别对齐到该语义空间,我们将(图像/视频/骨骼/运动捕捉)数据集整合到统一标签系统的数据库中,即架设“孤立岛屿”连成“联合大陆”。在此基础上,我们提出从物理空间映射到语义空间的新型模型以充分利用联合大陆。大量实验表明,新系统展现出显著优越性,尤其在迁移学习方面。我们的代码与数据将发布于https://mvig-rhos.com/pangea。