We present a generative approach to forecast long-term future human behavior in 3D, requiring only weak supervision from readily available 2D human action data. This is a fundamental task enabling many downstream applications. The required ground-truth data is hard to capture in 3D (mocap suits, expensive setups) but easy to acquire in 2D (simple RGB cameras). Thus, we design our method to only require 2D RGB data while being able to generate 3D human motion sequences. We use a differentiable 2D projection scheme in an autoregressive manner for weak supervision, and an adversarial loss for 3D regularization. Our method predicts long and complex behavior sequences (e.g. cooking, assembly) consisting of multiple sub-actions. We tackle this in a semantically hierarchical manner, jointly predicting high-level coarse action labels together with their low-level fine-grained realizations as characteristic 3D human poses. We observe that these two action representations are coupled in nature, and joint prediction benefits both action and pose forecasting. Our experiments demonstrate the complementary nature of joint action and 3D pose prediction: our joint approach outperforms each task treated individually, enables robust longer-term sequence prediction, and outperforms alternative approaches to forecast actions and characteristic 3D poses.
翻译:我们提出一种生成式方法,用于预测长期未来人类行为的三维表示,该方法仅需从易于获取的二维人类动作数据中获得弱监督。这是一项基础性任务,可支持众多下游应用。所需的三维真实数据难以捕获(需动作捕捉服、昂贵设备),但二维数据易于获取(简单RGB摄像头)。因此,我们设计了仅需二维RGB数据即可生成三维人体运动序列的方法。我们采用可微二维投影方案进行自回归弱监督,并利用对抗性损失进行三维正则化。该方法可预测由多个子动作组成的长期复杂行为序列(如烹饪、组装)。我们以语义层次化方式处理此问题,联合预测高层粗粒度动作标签及其低层细粒度实现(以特征性三维人体姿态表示)。我们观察到这两种动作表示在本质上具有耦合性,联合预测对动作与姿态预测均有裨益。实验证明了联合动作与三维姿态预测的互补性:我们的联合方法优于单独处理每项任务,能实现更稳健的长期序列预测,并在动作预测与特征性三维姿态预测方面优于其他方法。