Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
翻译:预训练的视觉语言模型对物理世界缺乏良好的直觉。近期研究表明,监督微调可以提升模型在简单物理任务上的表现。然而,微调后的模型似乎并未学到能够泛化至新情境的稳健物理规则。基于认知科学的研究,我们假设模型需要通过与环境交互来正确学习其物理动态。我们采用强化学习方法训练模型通过环境交互进行学习。尽管通过交互学习能使模型提升其在任务内的表现,但这种方法未能产生具备可泛化物理直觉的模型。我们发现,针对单一任务训练的模型无法可靠地泛化至相关任务——即使这些任务共享视觉统计特征与物理原理,且无论模型是否通过交互方式训练。