Large Language Models (LLMs) handle physical commonsense information inadequately. As a result of being trained in a disembodied setting, LLMs often fail to predict an action's outcome in a given environment. However, predicting the effects of an action before it is executed is crucial in planning, where coherent sequences of actions are often needed to achieve a goal. Therefore, we introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs (images and text). Next, we extend an LLM to model latent representations of objects to better predict action outcomes in an environment. We show that multi-modal models can capture physical commonsense when augmented with visual information. Finally, we evaluate our model's performance on novel actions and objects and find that combining modalities help models to generalize and learn physical commonsense reasoning better.
翻译:大型语言模型(LLMs)在处理物理常识信息方面存在不足。由于在非具身化环境中训练,LLMs 往往无法预测给定环境中动作的结果。然而,在执行动作前预测其效果对于规划至关重要,因为实现目标通常需要连贯的动作序列。为此,我们提出了一个多模态任务:仅从逼真的感官输入(图像和文本)预测动作的结果。接着,我们扩展了一个LLM,使其能够建模物体的潜在表征,从而更好地预测环境中的动作效果。研究表明,多模态模型在增强视觉信息后能够捕捉物理常识。最后,我们评估了模型在新动作和新物体上的表现,发现多模态融合有助于模型泛化并更有效地学习物理常识推理。