Vision language decision making (VLDM) is a challenging multimodal task. The agent have to understand complex human instructions and complete compositional tasks involving environment navigation and object manipulation. However, the long action sequences involved in VLDM make the task difficult to learn. From an environment perspective, we find that task episodes can be divided into fine-grained \textit{units}, each containing a navigation phase and an interaction phase. Since the environment within a unit stays unchanged, we propose a novel hybrid-training framework that enables active exploration in the environment and reduces the exposure bias. Such framework leverages the unit-grained configurations and is model-agnostic. Specifically, we design a Unit-Transformer (UT) with an intrinsic recurrent state that maintains a unit-scale cross-modal memory. Through extensive experiments on the TEACH benchmark, we demonstrate that our proposed framework outperforms existing state-of-the-art methods in terms of all evaluation metrics. Overall, our work introduces a novel approach to tackling the VLDM task by breaking it down into smaller, manageable units and utilizing a hybrid-training framework. By doing so, we provide a more flexible and effective solution for multimodal decision making.
翻译:视觉语言决策(VLDM)是一项具有挑战性的多模态任务。智能体需理解复杂的人类指令,并完成涉及环境导航与物体操作的石墨型任务。然而,VLDM中冗长的动作序列使得学习该任务变得困难。从环境视角出发,我们发现任务实例可划分为细粒度的\textit{单元},每个单元包含导航阶段与交互阶段。由于单元内环境保持不变,我们提出一种新型混合训练框架,该框架能在环境中实现主动探索并减少曝光偏差。该框架利用单元粒度配置,且具有模型无关性。具体而言,我们设计了具备内在循环状态的单元变压器(Unit-Transformer),该状态维持着单元尺度的跨模态记忆。通过在TEACH基准上的广泛实验,我们证明所提框架在所有评估指标上均优于现有最先进方法。总之,本研究通过将VLDM任务拆解为更小、更易管理的单元并利用混合训练框架,提出了一种创新方法。这为多模态决策提供了更灵活有效的解决方案。