We propose a one-stage framework for real-time multi-person 3D human mesh estimation from a single RGB image. While current one-stage methods, which follow a DETR-style pipeline, achieve state-of-the-art (SOTA) performance with high-resolution inputs, we observe that this particularly benefits the estimation of individuals in smaller scales of the image (e.g., those far from the camera), but at the cost of significantly increased computation overhead. To address this, we introduce scale-adaptive tokens that are dynamically adjusted based on the relative scale of each individual in the image within the DETR framework. Specifically, individuals in smaller scales are processed at higher resolutions, larger ones at lower resolutions, and background regions are further distilled. These scale-adaptive tokens more efficiently encode the image features, facilitating subsequent decoding to regress the human mesh, while allowing the model to allocate computational resources more effectively and focus on more challenging cases. Experiments show that our method preserves the accuracy benefits of high-resolution processing while substantially reducing computational cost, achieving real-time inference with performance comparable to SOTA methods.
翻译:我们提出了一种从单张RGB图像进行实时多人三维人体网格估计的单阶段框架。尽管当前遵循DETR风格流程的单阶段方法在使用高分辨率输入时实现了最先进的性能,但我们观察到,这尤其有利于图像中较小尺度个体(例如远离摄像头的个体)的估计,但代价是计算开销显著增加。为解决此问题,我们在DETR框架内引入了尺度自适应令牌,这些令牌根据图像中每个个体的相对尺度进行动态调整。具体而言,较小尺度的个体以更高分辨率处理,较大尺度的个体以较低分辨率处理,背景区域则被进一步蒸馏。这些尺度自适应令牌能更高效地编码图像特征,促进后续解码以回归人体网格,同时使模型能够更有效地分配计算资源并专注于更具挑战性的情况。实验表明,我们的方法在保持高分辨率处理精度优势的同时,大幅降低了计算成本,实现了与最先进方法性能相当的实时推理。