Reconstructing animatable 3D humans from casually captured images of articulated subjects without camera or pose information is highly practical but remains challenging due to view misalignment, occlusions, and the absence of structural priors. In this work, we present LHM++, an efficient large-scale human reconstruction model that generates high-quality, animatable 3D avatars within seconds from one or multiple pose-free images. At its core is an Encoder-Decoder Point-Image Transformer architecture that progressively encodes and decodes 3D geometric point features to improve efficiency, while fusing hierarchical 3D point features with image features through multimodal attention. The fused features are decoded into 3D Gaussian splats to recover detailed geometry and appearance. To further enhance visual fidelity, we introduce a lightweight 3D-aware neural animation renderer that refines the rendering quality of reconstructed avatars in real time. Extensive experiments show that our method produces high-fidelity, animatable 3D humans without requiring camera or pose annotations. Our code and project page are available at https://lingtengqiu.github.io/LHM++/
翻译:从无相机或姿态信息的随意拍摄关节主体图像中重建可动画的三维人体具有高度实用性,但由于视角错位、遮挡以及结构先验的缺失,该任务仍具挑战性。本工作提出LHM++,一种高效的大规模人体重建模型,能够在一张或多张无姿态图像输入后数秒内生成高质量、可动画的三维化身。其核心是一个编码器-解码器点-图像Transformer架构,该架构通过渐进式编码和解码三维几何点特征以提高效率,同时通过多模态注意力将分层的三维点特征与图像特征相融合。融合后的特征被解码为三维高斯溅射以恢复精细的几何与外观。为进一步提升视觉保真度,我们引入了一个轻量级的三维感知神经动画渲染器,用于实时优化重建化身的渲染质量。大量实验表明,我们的方法无需相机或姿态标注即可生成高保真、可动画的三维人体。我们的代码与项目页面可在 https://lingtengqiu.github.io/LHM++/ 获取。