Human mesh recovery from multi-view images faces a fundamental challenge: real-world datasets contain imperfect ground-truth annotations that bias the models' training, while synthetic data with precise supervision suffers from domain gap. In this paper, we propose DiffProxy, a novel framework that generates multi-view consistent human proxies for mesh recovery. Central to DiffProxy is leveraging the diffusion-based generative priors to bridge the synthetic training and real-world generalization. Its key innovations include: (1) a multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) a hand refinement module that incorporates flexible visual prompts to enhance local details; and (3) an uncertainty-aware test-time scaling method that increases robustness to challenging cases during optimization. These designs ensure that the mesh recovery process effectively benefits from the precise synthetic ground truth and generative advantages of the diffusion-based pipeline. Trained entirely on synthetic data, DiffProxy achieves state-of-the-art performance across five real-world benchmarks, demonstrating strong zero-shot generalization particularly on challenging scenarios with occlusions and partial views. Project page: https://wrk226.github.io/DiffProxy.html
翻译:从多视角图像中恢复人体网格面临一个根本性挑战:真实世界数据集包含不完美的标注真值,会引入模型训练偏差;而具有精确监督的合成数据则存在领域差距问题。本文提出DiffProxy,一种通过生成多视角一致的人体代理来实现网格恢复的新框架。DiffProxy的核心在于利用基于扩散模型的生成先验,以弥合合成数据训练与真实世界泛化之间的鸿沟。其关键创新包括:(1) 一种用于生成多视角一致、像素对齐的人体代理的多条件机制;(2) 一个结合灵活视觉提示以增强局部细节的手部细化模块;(3) 一种不确定性感知的测试时缩放方法,可在优化过程中提升对挑战性案例的鲁棒性。这些设计确保了网格恢复过程能够有效利用精确的合成真值以及基于扩散流程的生成优势。DiffProxy完全在合成数据上训练,在五个真实世界基准测试中取得了最先进的性能,尤其在存在遮挡和局部视角的挑战性场景中展现出强大的零样本泛化能力。项目页面:https://wrk226.github.io/DiffProxy.html