Human activity understanding is of widespread interest in artificial intelligence and spans diverse applications like health care and behavior analysis. Although there have been advances in deep learning, it remains challenging. The object recognition-like solutions usually try to map pixels to semantics directly, but activity patterns are much different from object patterns, thus hindering success. In this work, we propose a novel paradigm to reformulate this task in two stages: first mapping pixels to an intermediate space spanned by atomic activity primitives, then programming detected primitives with interpretable logic rules to infer semantics. To afford a representative primitive space, we build a knowledge base including 26+ M primitive labels and logic rules from human priors or automatic discovering. Our framework, the Human Activity Knowledge Engine (HAKE), exhibits superior generalization ability and performance upon canonical methods on challenging benchmarks. Code and data are available at http://hake-mvig.cn/.
翻译:人类活动理解在人工智能领域具有广泛兴趣,涵盖医疗保健和行为分析等多种应用。尽管深度学习取得了进展,但该任务仍具挑战性。类似物体识别的方案通常试图直接将像素映射到语义,但活动模式与物体模式存在显著差异,阻碍了其成功。本文提出一种新范式,将任务重构为两个阶段:首先将像素映射到由原子活动基元构成的中间空间,然后通过可解释的逻辑规则对检测到的基元进行编程以推断语义。为构建具有代表性的基元空间,我们建立了一个知识库,包含来自人类先验或自动发现的2600万个以上基元标签及逻辑规则。我们的框架——人类活动知识引擎(HAKE)——在具有挑战性的基准测试中展现出优于传统方法的泛化能力和性能。代码及数据见 http://hake-mvig.cn/。