Current methods in training and benchmarking vision models exhibit an over-reliance on passive, curated datasets. Although models trained on these datasets have shown strong performance in a wide variety of tasks such as classification, detection, and segmentation, they fundamentally are unable to generalize to an ever-evolving world due to constant out-of-distribution shifts of input data. Therefore, instead of training on fixed datasets, can we approach learning in a more human-centric and adaptive manner? In this paper, we introduce \textbf{A}ction-aware Embodied \textbf{L}earning for \textbf{P}erception (ALP), an embodied learning framework that incorporates action information into representation learning through a combination of optimizing policy gradients through reinforcement learning and inverse dynamics prediction objectives. Our method actively explores complex 3D environments to both learn generalizable task-agnostic representations as well as collect downstream training data. We show that ALP outperforms existing baselines in object detection and semantic segmentation. In addition, we show that by training on actively collected data more relevant to the environment and task, our method generalizes more robustly to downstream tasks compared to models pre-trained on fixed datasets such as ImageNet.
翻译:当前视觉模型训练与基准测试方法过度依赖被动式、精心整理的数据集。尽管基于此类数据集训练的模型在分类、检测与分割等广泛任务中表现优异,但由于输入数据持续存在分布偏移,它们本质上无法泛化至不断变化的世界。因此,我们能否摒弃固定数据集训练,采用更接近人类学习、更具适应性的方式?本文提出**面向感知的动作感知具身学习**(ALP),该具身学习框架通过融合策略梯度强化学习优化与逆向动力学预测目标,将动作信息融入表征学习。该方法主动探索复杂三维环境,既能学习可泛化的任务无关表征,又能收集下游训练数据。实验表明,ALP在目标检测与语义分割任务中优于现有基线方法。此外,相较于在ImageNet等固定数据集上预训练的模型,基于主动采集的、与环境及任务更相关数据训练的本方法,能更鲁棒地泛化至下游任务。