Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.
翻译:几何形状视点不变性使得任何图像集合均为单一三维状态的冗余编码。现有前馈重建模型未能利用这一特性:逐视角方法生成的局部点云存在重叠与未对齐问题,且输出规模随输入数量线性增长;全局潜变量方法则局限于固定低分辨率输出。我们提出Surflo,将可变数量的无位姿RGB视图压缩为K个潜变量令牌(全局状态),并通过流匹配将噪声点独立传输至表面,从而解码出带朝向的三维表面点。该方法摆脱了固定网格或令牌预算约束:同一潜变量在一次前向传播中可生成数千至百万个点。为抑制独立逐点解码固有的局部不一致性,我们引入推理时引导项,在常微分方程积分过程中注入光度梯度以实现近邻点相关性约束。Surflo在表面指标上达到或超越前馈基线模型,其运行速度相较需要数百个视图的优化方法提升一个数量级,且是唯一结合全局潜变量与任意分辨率解码的前馈方法。