In this paper, we introduce a set of simple yet effective TOken REduction (TORE) strategies for Transformer-based Human Mesh Recovery from monocular images. Current SOTA performance is achieved by Transformer-based structures. However, they suffer from high model complexity and computation cost caused by redundant tokens. We propose token reduction strategies based on two important aspects, i.e., the 3D geometry structure and 2D image feature, where we hierarchically recover the mesh geometry with priors from body structure and conduct token clustering to pass fewer but more discriminative image feature tokens to the Transformer. Our method massively reduces the number of tokens involved in high-complexity interactions in the Transformer. This leads to a significantly reduced computational cost while still achieving competitive or even higher accuracy in shape recovery. Extensive experiments across a wide range of benchmarks validate the superior effectiveness of the proposed method. We further demonstrate the generalizability of our method on hand mesh recovery. Visit our project page at https://frank-zy-dou.github.io/projects/Tore/index.html.
翻译:本文提出一组简单而有效的令牌缩减(TOken REduction, TORE)策略,用于从单目图像中基于Transformer进行人体网格恢复。当前最先进的性能由Transformer结构实现,但其因冗余令牌导致模型复杂度高、计算成本大。我们基于三维几何结构与二维图像特征两个关键方面提出令牌缩减策略:利用人体结构先验层级式恢复网格几何,并通过令牌聚类向Transformer传递更少但更具判别力的图像特征令牌。该方法大幅减少了Transformer中高复杂度交互所涉及的令牌数量,在显著降低计算成本的同时,在形状恢复精度上仍能达到甚至超越现有水平。跨多个基准的广泛实验验证了该方法的优越有效性。我们进一步在手掌网格恢复中验证了方法的泛化能力。项目主页请访问 https://frank-zy-dou.github.io/projects/Tore/index.html。