Action understanding matters and attracts attention. It can be formed as the mapping from the action physical space to the semantic space. Typically, researchers built action datasets according to idiosyncratic choices to define classes and push the envelope of benchmarks respectively. Thus, datasets are incompatible with each other like "Isolated Islands" due to semantic gaps and various class granularities, e.g., do housework in dataset A and wash plate in dataset B. We argue that a more principled semantic space is an urgent need to concentrate the community efforts and enable us to use all datasets together to pursue generalizable action learning. To this end, we design a Poincare action semantic space given verb taxonomy hierarchy and covering massive actions. By aligning the classes of previous datasets to our semantic space, we gather (image/video/skeleton/MoCap) datasets into a unified database in a unified label system, i.e., bridging "isolated islands" into a "Pangea". Accordingly, we propose a bidirectional mapping model between physical and semantic space to fully use Pangea. In extensive experiments, our system shows significant superiority, especially in transfer learning. Code and data will be made publicly available.
翻译:动作理解具有重要研究价值并备受关注,可形式化为从动作物理空间到语义空间的映射。现有研究者通常依据个性化类别定义来构建动作数据集并分别推进基准性能,导致数据集因语义鸿沟和类别粒度差异(如数据集A中的"做家务"与数据集B中的"洗盘子")互不兼容,形成"孤岛"现象。我们主张亟需建立更具原则性的语义空间以凝聚学界力量,实现多数据集联合训练下的可泛化动作学习。为此,基于动词分类层级体系并覆盖海量动作,我们设计了庞加莱动作语义空间。通过将现有数据集的类别对齐至该语义空间,我们以统一标签体系将(图像/视频/骨骼/动作捕捉)数据集整合为统一数据库,即架设"孤岛"通向"泛古陆"的桥梁。据此提出物理空间与语义空间的双向映射模型以充分利用泛古陆。大量实验表明,本系统在迁移学习等方面展现出显著优势。代码与数据将公开发布。