Multiview 3D evaluation assumes that the images being scored are observations of one static 3D scene. This assumption can fail in NVS and sparse-view reconstruction: inputs or generated outputs may contain artifacts, outlier frames, repeated views, or noise, yet still receive high 3D consistency scores. Existing reference-based metrics require ground truth, while ground-truth-free metrics such as MEt3R depend on learned reconstruction backbones whose failure modes are poorly characterized. We study this reliability problem by comparing neural reconstruction priors with classical geometric verification. We introduce \benchmark, a controlled robustness benchmark for multiview 3D consistency, and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to $3\times$ more robust. Our analysis shows that VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to $4\times$ higher correlation with human judgments than MEt3R.
翻译:多视图三维评估假设被评分的图像是某个静态3D场景的观测结果。这一假设在新视角合成和稀疏视图重建中可能失效:输入或生成的输出可能包含伪影、离群帧、重复视图或噪声,但仍能获得较高的三维一致性得分。现有基于参考的度量需要真实数据,而无需真实数据的度量(如MEt3R)依赖学习得到的重建主干网络,其失效模式尚未得到充分刻画。我们通过对比神经重建先验与经典几何验证来研究这一可靠性问题。我们提出了\benchmark——一个用于多视图三维一致性的受控鲁棒性基准,以及一个参数化体系,将神经度量分解为骨干网络、残差和聚合组件。该体系恢复了MEt3R,并生成了鲁棒性高达三倍的变体。我们的分析表明,VGGT、MASt3R、DUSt3R和Fast3R可能针对不相关场景、重复图像和随机噪声产生密集几何和跨视图支持的幻觉。我们提出了基于COLMAP的度量,将匹配、配准、密集支持和重建失败作为具有失败感知能力的一致性信号。在真实的新视角合成输出和结构化人类研究中,这些度量与人类判断的相关性比MEt3R高出四倍。