Pixal3D: Pixel-Aligned 3D Generation from Images

Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. To tackle this issue, we draw inspiration from 3D reconstruction and propose Pixal3D, a pixel-aligned 3D generation paradigm for high-fidelity 3D asset creation from images. Instead of generating in a canonical pose, Pixal3D directly generates 3D in a pixel-aligned way, consistent with the input view. To enable this, we introduce a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume, establishing direct pixel-to-3D correspondence without ambiguity. We show that Pixal3D is not only scalable and capable of producing high-quality 3D assets, but also substantially improves fidelity, approaching the fidelity level of reconstruction. Furthermore, Pixal3D naturally extends to multi-view generation by aggregating back-projected feature volumes across views. Finally, we show pixel-aligned generation benefits scene synthesis, and present a modular pipeline that produces high-fidelity, object-separated 3D scenes from images. Pixal3D for the first time demonstrates 3D-native pixel-aligned generation at scale, and provides a new inspiring way towards high-fidelity 3D generation of object or scene from single or multi-view images. Project page: https://ldyang694.github.io/projects/pixal3d/

翻译：近年来三维生成模型的进展显著提升了图像到三维的合成质量，实现了更高分辨率的几何结构与更逼真的外观。然而，衡量生成三维资产与输入图像像素级保真度的指标仍是核心瓶颈。我们认为这源于隐式的二维-三维对应问题：大多数三维原生生成器在规范空间中合成形状，并通过注意力机制注入图像线索，导致像素到三维的关联存在歧义。为解决该问题，我们从三维重建中汲取灵感，提出Pixal3D——一种像素对齐的三维生成范式，用于从图像创建高保真三维资产。Pixal3D不采用规范姿态生成，而是以与输入视角一致的像素对齐方式直接生成三维内容。为实现这一点，我们引入一种像素反向投影条件机制，将多尺度图像特征显式提升至三维特征体，建立起无歧义的直接像素到三维对应关系。实验表明，Pixal3D不仅具有可扩展性且能生成高质量三维资产，更大幅提升了保真度，接近重建级别的精度。此外，Pixal3D通过聚合跨视角的反向投影特征体，自然扩展至多视角生成。最后，我们展示了像素对齐生成对场景合成的助益，并提出一种模块化流水线，可从图像生成高保真、对象分离的三维场景。Pixal3D首次验证了大规模三维原生像素对齐生成，为从单/多视角图像实现物体或场景的高保真三维生成提供了启发性新路径。项目页面：https://ldyang694.github.io/projects/pixal3d/