MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area, we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data, we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes certain actions within the environment, as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time, MultiPLY could generate action tokens, instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval, tool use, multisensory captioning, and task decomposition.

翻译：摘要：人类具备在主动探索和交互3D世界时融合多种感官线索的能力。然而，当前的多模态大语言模型仅被动接收感官数据作为输入，缺乏与三维环境中物体主动交互并动态收集其多感官信息的能力。为开启该领域研究，我们提出MultiPLY——一种能够将视觉、听觉、触觉和热感等多感官交互数据融入大语言模型的多感官具身大语言模型，从而建立词语、动作与感知之间的关联。为此，我们首先构建多感官宇宙（Multisensory Universe），这是一个大规模多感官交互数据集，通过部署基于LLM的具身智能体与三维环境交互而获取50万条数据。为对预训练LLM进行指令微调，我们首先将三维场景编码为抽象化的对象中心表征，随后引入表示具身智能体在环境中执行特定动作的动作令牌（action tokens），以及表示智能体在各时间步多感官状态观测的状态令牌（state tokens）。在推理阶段，MultiPLY可生成动作令牌，指导智能体在环境中执行动作并获取下一时刻的多感官状态观测。该观测随后通过状态令牌反馈至LLM，用于生成后续文本或动作令牌。我们通过涉及物体检索、工具使用、多感官描述与任务分解等多样化具身任务，证明MultiPLY在性能上大幅超越基线模型。