We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.
翻译:我们提出了'单一生平'学习范式,即仅利用单个个体采集的自我中心视角视频训练一个独立的视觉模型。我们利用单一生命历程中自然捕获的多视角数据,以自监督方式学习视觉编码器。实验揭示了三个关键发现。首先,在不同生命数据上独立训练的模型形成了高度对齐的几何理解。我们通过在分别记录室内外不同生命历程的独立数据集上训练视觉编码器,并引入基于交叉注意力的新型度量方法来验证这一点,该指标可量化不同模型内部表征的功能对齐程度。其次,单一生平模型学习到的几何表征具有可泛化性,能有效迁移至未见环境中的下游任务(如深度估计)。第三,实验表明,使用同一人物一周内最多30小时的数据进行训练,其性能与使用30小时多样化网络数据训练相当,这凸显了单一生平表征学习的优势。总体而言,我们的研究证实:世界的共享结构不仅促使基于个体生命数据训练的模型保持一致性,同时也为视觉表征学习提供了强大的信号源。