Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $π^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.
翻译:大规模多视角重建模型取得了显著进展,但大多数现有方法仍依赖于带有真实三维/四维标注的全监督训练。此类标注成本高昂且尤其缺乏动态场景的标注,限制了可扩展性。我们提出SelfEvo——一种无需外部标注、利用未标注视频持续改进预训练多视角重建模型的自我改进框架。SelfEvo引入基于时空上下文不对称性的自蒸馏机制,使基于学习的四维感知实现自我改进。我们系统研究了提升自我改进有效性的设计选择,包括损失信号、不对称形式及其他训练策略。在涵盖多样化数据集与领域的八个基准测试中,SelfEvo持续改进预训练基线模型,并展现出跨基础模型(如VGGT和π³)的泛化能力,在动态场景上取得显著提升。总体而言,SelfEvo在无需任何标注数据的情况下,使视频深度估计相对提升最高达36.5%,相机估计提升20.1%。项目页面:https://self-evo.github.io/。