Transformer architectures have achieved SOTA performance on the human mesh recovery (HMR) from monocular images. However, the performance gain has come at the cost of substantial memory and computational overhead. A lightweight and efficient model to reconstruct accurate human mesh is needed for real-world applications. In this paper, we propose a pure transformer architecture named POoling aTtention TransformER (POTTER) for the HMR task from single images. Observing that the conventional attention module is memory and computationally expensive, we propose an efficient pooling attention module, which significantly reduces the memory and computational cost without sacrificing performance. Furthermore, we design a new transformer architecture by integrating a High-Resolution (HR) stream for the HMR task. The high-resolution local and global features from the HR stream can be utilized for recovering more accurate human mesh. Our POTTER outperforms the SOTA method METRO by only requiring 7% of total parameters and 14% of the Multiply-Accumulate Operations on the Human3.6M (PA-MPJPE metric) and 3DPW (all three metrics) datasets. The project webpage is https://zczcwh.github.io/potter_page.
翻译:Transformer架构在基于单目图像的人体网格重建(HMR)任务中取得了当前最优(SOTA)性能。然而,性能提升伴随着巨大的内存和计算开销。在实际应用中,需要一种轻量高效且能重建精确人体网格的模型。本文针对单张图像的HMR任务,提出了一种名为池化注意力Transformer(POTTER)的纯Transformer架构。针对传统注意力模块内存与计算成本过高的问题,我们提出了一种高效的池化注意力模块,该模块在不牺牲性能的前提下显著降低了内存与计算开销。此外,我们通过融合高分辨率(HR)流设计了一种新的Transformer架构,利用HR流中的高分辨率局部与全局特征可重建更精确的人体网格。我们的POTTER在Human3.6M(PA-MPJPE指标)和3DPW(所有三项指标)数据集上,仅需SOTA方法METRO的7%总参数量和14%乘法累加操作数即可超越其性能。项目主页:https://zczcwh.github.io/potter_page。