SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.
翻译:SAM 3D Body(3DB)在单目3D人体网格重建中实现了最先进的精度,但其每张图像数秒的推理延迟阻碍了实时应用。我们提出了Fast SAM 3D Body,一种无需训练的加速框架,它重构了3DB的推理路径以实现交互式速率。通过解耦串行的空间依赖性并应用架构感知剪枝,我们实现了并行的多裁剪特征提取和简化的Transformer解码。此外,为了提取与现有人形机器人控制和策略学习框架兼容的关节级运动学(SMPL),我们用直接的前馈映射替代了迭代的网格拟合,将此特定转换过程加速了超过10,000倍。总体而言,我们的框架实现了高达10.9倍的端到端加速,同时保持了相当的网格重建保真度,甚至在LSPET等基准测试中超越了原始3DB。我们通过将Fast SAM 3D Body部署在一个纯视觉遥操作系统中,展示了其实用性。与依赖可穿戴IMU的方法不同,该系统能够实现实时的人形机器人控制,并可直接从单一RGB视频流中采集操作策略。