3D human mesh recovery from monocular RGB images aims to estimate anatomically plausible 3D human models for downstream applications, but remains challenging under partial or severe occlusions. Regression-based methods are efficient yet often produce implausible or inaccurate results in unconstrained scenarios, while diffusion-based methods provide strong generative priors for occluded regions but may weaken fidelity to rare poses due to over-reliance on generation. To address these limitations, we propose a brain-inspired synergistic framework that integrates the discriminative power of vision transformers with the generative capability of conditional diffusion models. Specifically, the ViT-based pathway extracts deterministic visual cues from visible regions, while the diffusion-based pathway synthesizes structurally coherent human body representations. To effectively bridge the two pathways, we design a diverse-consistent feature learning module to align discriminative features with generative priors, and a cross-attention multi-level fusion mechanism to enable bidirectional interaction across semantic levels. Experiments on standard benchmarks demonstrate that our method achieves superior performance on key metrics and shows strong robustness in complex real-world scenarios.
翻译:从单目RGB图像恢复三维人体网格旨在为下游应用估计解剖学上合理的三维人体模型,但在部分或严重遮挡条件下仍具挑战性。基于回归的方法效率高,但在无约束场景下常产生不合理或不准确的结果;而基于扩散的方法为遮挡区域提供强生成先验,但因过度依赖生成可能导致对罕见姿态的保真度下降。为解决这些局限,我们提出一种受大脑启发的协同框架,融合视觉Transformer的判别能力与条件扩散模型的生成能力。具体而言,基于ViT的路径从可见区域提取确定性视觉线索,而基于扩散的路径合成结构连贯的人体表征。为有效连接两条路径,我们设计了多样一致性特征学习模块以对齐判别特征与生成先验,并引入跨注意力多层级融合机制实现语义层级的双向交互。标准基准实验表明,该方法在关键指标上取得优越性能,并在复杂真实场景中展现出强鲁棒性。