To be successful, Vision-and-Language Navigation (VLN) agents must be able to ground instructions to actions based on their surroundings. In this work, we develop a methodology to study agent behavior on a skill-specific basis -- examining how well existing agents ground instructions about stopping, turning, and moving towards specified objects or rooms. Our approach is based on generating skill-specific interventions and measuring changes in agent predictions. We present a detailed case study analyzing the behavior of a recent agent and then compare multiple agents in terms of skill-specific competency scores. This analysis suggests that biases from training have lasting effects on agent behavior and that existing models are able to ground simple referring expressions. Our comparisons between models show that skill-specific scores correlate with improvements in overall VLN task performance.
翻译:为取得成功,视觉与语言导航(VLN)代理必须能够根据周围环境将指令与动作进行对应。在本研究中,我们开发了一种基于特定技能研究代理行为的方法论——考察现有代理在停止、转向以及朝向指定对象或房间移动等指令对应方面的表现。该方法基于生成特定技能干预并测量代理预测的变化。我们通过详细案例研究分析了一个最新代理的行为,随后在技能特定能力评分方面比较了多个代理。该分析表明,训练过程中的偏差对代理行为具有持续性影响,且现有模型能够对应简单的指代表达。模型间的比较显示,技能特定评分与总体VLN任务性能改进之间存在相关性。