Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

Weijie Wang,Qihang Cao,Sensen Gao,Donny Y. Chen,Haofei Xu,Wenjing Bian,Songyou Peng,Tat-Jen Cham,Chuanxia Zheng,Andreas Geiger,Jianfei Cai,Jia-Wang Bian,Bohan Zhuang

from arxiv, 67 pages, 395 references. Project page: https://ff3d-survey.github.io. Code: https://github.com/ziplab/Awesome-Feed-Forward-3D. This work has been submitted to Springer for possible publication

Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and interacting with the physical world. While traditional methods achieve high fidelity, they are limited by slow per-scene optimization or category-specific training, which hinders their practical deployment and scalability. Hence, generalizable feed-forward 3D reconstruction has witnessed rapid development in recent years. By learning a model that maps images directly to 3D representations in a single forward pass, these methods enable efficient reconstruction and robust cross-scene generalization. Our survey is motivated by a critical observation: despite the diverse geometric output representations, ranging from implicit fields to explicit primitives, existing feed-forward approaches share similar high-level architectural patterns, such as image feature extraction backbones, multi-view information fusion mechanisms, and geometry-aware design principles. Consequently, we abstract away from these representation differences and instead focus on model design, proposing a novel taxonomy centered on model design strategies that are agnostic to the output format. Our proposed taxonomy organizes the research directions into five key problems that drive recent research development: feature enhancement, geometry awareness, model efficiency, augmentation strategies and temporal-aware models. To support this taxonomy with empirical grounding and standardized evaluation, we further comprehensively review related benchmarks and datasets, and extensively discuss and categorize real-world applications based on feed-forward 3D models. Finally, we outline future directions to address open challenges such as scalability, evaluation standards, and world modeling.

翻译：从二维输入重建三维表示是计算机视觉与图形学中的基础任务，为理解和交互物理世界提供了核心支撑。传统方法虽能实现高保真度，但受限于逐场景优化速度慢或类别专用训练，阻碍了其实际部署与可扩展性。因此，近年来具有泛化能力的前馈式三维重建技术得到迅速发展。通过训练模型在单次前向传播中直接将图像映射为三维表示，这类方法实现了高效重建与鲁棒的跨场景泛化。本综述基于一个关键观察展开：尽管现有的前馈方法在几何输出表示上形态各异（从隐式场到显式基元），但其高层架构设计存在共性模式，例如图像特征提取主干网络、多视图信息融合机制以及几何感知设计原则。因此，我们跳出表示形式的差异，聚焦于模型设计本身，提出一种新型分类体系——该体系以与输出格式无关的模型设计策略为核心。该分类体系将研究方向归纳为驱动近期发展的五大关键问题：特征增强、几何感知、模型效率、增强策略及时序感知模型。为给本分类体系提供实验支撑与标准化评估，我们进一步系统梳理了相关基准与数据集，并基于前馈式三维模型对实际应用进行深入讨论与分类。最后，我们展望了解决开放挑战（如可扩展性、评估标准与世界建模）的未来方向。