Large Language Models (LLMs) handle physical commonsense information inadequately. As a result of being trained in a disembodied setting, LLMs often fail to predict an action's outcome in a given environment. However, predicting the effects of an action before it is executed is crucial in planning, where coherent sequences of actions are often needed to achieve a goal. Therefore, we introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs (images and text). Next, we extend an LLM to model latent representations of objects to better predict action outcomes in an environment. We show that multi-modal models can capture physical commonsense when augmented with visual information. Finally, we evaluate our model's performance on novel actions and objects and find that combining modalities help models to generalize and learn physical commonsense reasoning better.
翻译:大型语言模型在处理物理常识信息方面存在不足。由于在去具身化环境中训练,这些模型往往无法预测特定环境中动作的结果。然而,在执行动作前预测其效果对于规划至关重要——实现目标通常需要连贯的动作序列。为此,我们提出从真实感官输入(图像与文本)中仅凭动作本身预测其效果的多模态任务。进而,我们对大型语言模型进行扩展,使其能够对物体的潜在表征进行建模,从而更精准地预测环境中的动作效果。研究表明,当融入视觉信息时,多模态模型能够捕捉物理常识。最后,我们在新动作和新物体场景下评估模型性能,发现多模态融合有助于模型实现泛化并提升物理常识推理能力。