We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications. Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes. Concretely, FLARE starts with camera pose estimation, whose results condition the subsequent learning of geometric structure and appearance, optimized through the objectives of geometry reconstruction and novel-view synthesis. Utilizing large-scale public datasets for training, our method delivers state-of-the-art performance in the tasks of pose estimation, geometry reconstruction, and novel view synthesis, while maintaining the inference efficiency (i.e., less than 0.5 seconds). The project page and code can be found at: https://zhanghe3z.github.io/FLARE/
翻译:本文提出FLARE,一种前馈模型,旨在从未标定稀疏视角图像(即仅需2-8张输入)中推断高质量相机位姿与三维几何结构,这是实际应用中具有挑战性但极具实用价值的设定。我们的解决方案采用以相机位姿为关键桥梁的级联学习范式,认识到其在将三维结构映射至二维图像平面中的核心作用。具体而言,FLARE首先进行相机位姿估计,其结果作为条件指导后续几何结构与外观的学习,并通过几何重建与新视角合成的目标进行优化。通过利用大规模公开数据集进行训练,本方法在相机位姿估计、几何重建和新视角合成任务中均实现了最先进的性能,同时保持了高效的推理速度(即少于0.5秒)。项目页面与代码可见:https://zhanghe3z.github.io/FLARE/