Understanding multimodal perception for embodied AI is an open question because such inputs may contain highly complementary as well as redundant information for the task. A relevant direction for multimodal policies is understanding the global trends of each modality at the fusion layer. To this end, we disentangle the attributions for visual, language, and previous action inputs across different policies trained on the ALFRED dataset. Attribution analysis can be utilized to rank and group the failure scenarios, investigate modeling and dataset biases, and critically analyze multimodal EAI policies for robustness and user trust before deployment. We present MAEA, a framework to compute global attributions per modality of any differentiable policy. In addition, we show how attributions enable lower-level behavior analysis in EAI policies for language and visual attributions.
翻译:理解具身人工智能的多模态感知仍是一个开放性问题,因为此类输入可能包含高度互补甚至冗余的任务相关信息。多模态策略的一个相关方向是理解融合层中各模态的全局趋势。为此,我们对基于ALFRED数据集训练的不同策略中的视觉、语言和先前动作输入进行归因解耦。归因分析可用于对失败场景进行排序与分组、探究建模与数据集偏差,并在部署前对多模态EAI策略的鲁棒性与用户信任度进行批判性分析。我们提出MAEA框架,该框架可计算任意可微策略中每个模态的全局归因。此外,我们展示了归因如何支持EAI策略中语言与视觉归因的底层行为分析。