Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models. For instance, applying to MotionBERT and MixSTE on Human3.6M, our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop, respectively. Our source code will be open-sourced.
翻译:Transformer在基于视频的三维人体姿态估计领域已成功应用。然而,这些视频姿态Transformer(VPTs)的高计算代价使其在资源受限设备上难以实用。本文提出一种即插即用的剪枝与恢复框架,称为沙漏式令牌化器(HoT),用于基于Transformer的高效视频三维人体姿态估计。我们的HoT从剪枝冗余帧的姿态令牌开始,以恢复完整长度令牌结束,使得中间Transformer模块中仅保留少量姿态令牌,从而提升模型效率。为实现这一目标,我们提出令牌剪枝聚类(TPC),其能动态选择具有高语义多样性的少量代表令牌,同时消除视频帧的冗余性。此外,我们设计了令牌恢复注意力(TRA),基于所选令牌恢复详细的时空信息,从而将网络输出扩展至原始完整时间分辨率以实现快速推理。在两个基准数据集(即Human3.6M和MPI-INF-3DHP)上的大量实验表明,与原始VPT模型相比,我们的方法能够同时实现高效率和估计精度。例如,在Human3.6M数据集上应用于MotionBERT和MixSTE时,我们的HoT可在不牺牲精度的情况下节省近50%的FLOPs,以及在仅0.2%精度下降下节省近40%的FLOPs。源代码将开源。