Decision making via sequence modeling aims to mimic the success of language models, where actions taken by an embodied agent are modeled as tokens to predict. Despite their promising performance, it remains unclear if embodied sequence modeling leads to the emergence of internal representations that represent the environmental state information. A model that lacks abstract state representations would be liable to make decisions based on surface statistics which fail to generalize. We take the BabyAI environment, a grid world in which language-conditioned navigation tasks are performed, and build a sequence modeling Transformer, which takes a language instruction, a sequence of actions, and environmental observations as its inputs. In order to investigate the emergence of abstract state representations, we design a "blindfolded" navigation task, where only the initial environmental layout, the language instruction, and the action sequence to complete the task are available for training. Our probing results show that intermediate environmental layouts can be reasonably reconstructed from the internal activations of a trained model, and that language instructions play a role in the reconstruction accuracy. Our results suggest that many key features of state representations can emerge via embodied sequence modeling, supporting an optimistic outlook for applications of sequence modeling objectives to more complex embodied decision-making domains.
翻译:序列建模驱动的决策制定旨在模仿语言模型成功的路径,其中具身智能体执行的动作被建模为待预测的令牌。尽管这类方法表现出色,但具身序列建模是否会导致表征环境状态信息的内部表征涌现仍不明确。缺乏抽象状态表征的模型将倾向于基于表面统计规律进行决策,从而难以实现泛化。我们采用BabyAI环境(一个执行语言条件导航任务的网格世界),构建了一个序列建模Transformer,其输入包含语言指令、动作序列及环境观测。为探究抽象状态表征的涌现机制,我们设计了一项“蒙眼”导航任务:训练过程中仅提供初始环境布局、语言指令及完成任务所需的动作序列。实验探测结果表明,从训练模型的内部激活中可以合理重构出中间环境布局,且语言指令对重构精度具有调节作用。本研究揭示:状态表征的诸多关键特征可通过具身序列建模自发涌现,这为将序列建模目标应用于更复杂的具身决策领域提供了乐观前景。