Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.
翻译:人体功能学习研究上下文相关的创新姿态预测,使得估计姿态能表示场景中有效的人体动作。虽然该任务对机器感知和自主交互导航代理至关重要,但可能的姿态与动作变体数量呈指数级增长,使得该问题具有挑战性且非平凡。然而,现有文献中针对二维场景下的人体功能预测数据集和方法极为有限。本文提出一种新颖的跨注意力机制,通过相互关注来自两种不同模态的空间特征图,对场景上下文进行编码以实现功能预测。该机制将任务解耦为多个子任务,有效降低了问题复杂度。首先,我们利用全局场景上下文编码条件变分自编码器(VAE)对场景中可能的人体位置进行采样;其次,根据预测位置周围的局部上下文编码,通过分类器从现有候选人体姿态集合中预测潜在姿态模板;后续步骤中,我们通过条件化局部上下文和模板类别的两个变分自编码器,对预测姿态模板的尺度与形变参数进行采样。实验表明,该方法相较于先前将人体功能注入复杂二维场景的基线方法取得了显著改进。