Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation framework with progressive Video Transformer, termed PSVT. In PSVT, a spatio-temporal encoder (STE) captures the global feature dependencies among spatial objects. Then, spatio-temporal pose decoder (STPD) and shape decoder (STSD) capture the global dependencies between pose queries and feature tokens, shape queries and feature tokens, respectively. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame. Besides, we propose a novel pose-guided attention (PGA) for shape decoder to better predict shape parameters. The two components strengthen the decoder of PSVT to improve performance. Extensive experiments on the four datasets show that PSVT achieves stage-of-the-art results.
翻译:现有的多人视频三维人体姿态与形状估计(PSE)方法通常采用两阶段策略:首先检测每帧中的人体实例,然后利用时序模型对单人进行PSE。然而,该方法无法捕获空间实例间的全局时空上下文。本文提出了一种新的端到端多人三维姿态与形状估计框架——渐进式视频Transformer(PSVT)。在PSVT中,时空编码器(STE)捕获空间对象间的全局特征依赖关系;随后,时空姿态解码器(STPD)和形状解码器(STSD)分别捕获姿态查询与特征令牌、形状查询与特征令牌之间的全局依赖关系。为应对对象随时间推移产生的变化,我们采用一种新颖的渐进式解码方案,逐帧更新姿态和形状查询。此外,我们提出了一种姿态引导注意力(PGA)机制,用于形状解码器以更准确地预测形状参数。这两大组件增强了PSVT解码器的性能。在四个数据集上的大量实验表明,PSVT达到了最先进的结果。