Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models. For instance, applying to MotionBERT and MixSTE on Human3.6M, our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop, respectively. Code and models are available at https://github.com/NationalGAILab/HoT.

翻译：Transformer已在基于视频的3D人体姿态估计领域得到成功应用。然而，这些视频姿态Transformer存在计算成本高的问题，在资源受限设备上缺乏实用性。本文提出一种即插即用的剪枝-恢复框架，称为沙漏分词器（Hourglass Tokenizer, HoT），用于从视频中实现高效的基于Transformer的3D人体姿态估计。我们的HoT首先对冗余帧的姿态令牌进行剪枝，最终恢复为全长令牌，从而在中间Transformer块中保留少量姿态令牌，进而提升模型效率。为实现这一目标，我们提出令牌剪枝聚类（Token Pruning Cluster, TPC），该机制动态选择少量具有高语义多样性的代表性令牌，同时消除视频帧的冗余性。此外，我们开发令牌恢复注意力（Token Recovering Attention, TRA），基于所选令牌恢复详细的时空信息，从而将网络输出扩展至原始全长时间分辨率，实现快速推理。在两个基准数据集（Human3.6M和MPI-INF-3DHP）上的大量实验表明，与原始VPT模型相比，本方法可在保持高效率和估计精度的同时取得优异性能。例如，在Human3.6M数据集上应用于MotionBERT和MixSTE时，我们的HoT可在不牺牲精度的情况下节省近50% FLOPs，在仅损失0.2%精度时节省近40% FLOPs。代码与模型已开源至https://github.com/NationalGAILab/HoT。