Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training. In this work, we present the first systematic study of multilingual instruction following in VLA models. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual. We provide several findings and analyses to understand the multilingual gap. Cross-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution. Representation analyses suggest that multilingual instruction-caused representation shifts may contribute to the multilingual gap. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs. We propose a simple yet effective multilingual fine-tuning approach, Multilingual Principal Component Alignment, which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap.

翻译：视觉-语言-动作模型近期在大规模多模态数据上展现了学习通用机器人策略的潜力。然而，现有的大多数VLA系统主要使用英语指令进行训练和评估，其在其他语言中理解和执行指令的能力尚未得到充分探索。尽管底层大语言模型通常具备多语言能力，但这些能力在训练过程中能否迁移至VLA仍不明确。本文首次对VLA模型中的多语言指令遵循进行系统性研究。我们首先通过扩展现有基准测试并翻译其指令，构建了多语言指令集。利用这些指令，我们在仿真环境中评估了多个代表性VLA模型在不同任务上的表现。实验揭示了一个显著的多语言鸿沟：主要基于英语指令训练的模型在评估其他语言时表现出明显的性能下降，即使其底层语言模型具备多语言能力也是如此。我们提供了若干发现与分析以理解这一鸿沟。跨语言迁移行为分析表明，性能下降与指令理解和动作执行两者均相关。表征分析提示，多语言指令引发的表征偏移可能是导致该鸿沟的因素。基于这些发现，我们进一步探索了改善VLA多语言性能的策略。我们提出了一种简单而有效的多语言微调方法——多语言主成分对齐，该方法利用主成分分析获取主成分子空间并对齐投影后的多语言表征，从而有效缩小多语言性能差距。