语言、视觉与动作表征的对齐 (Alignment among Language, Vision and Action Representations)

A fundamental question in cognitive science and AI concerns whether different learning modalities: language, vision, and action, give rise to distinct or shared internal representations. Traditional views assume that models trained on different data types develop specialized, non-transferable representations. However, recent evidence suggests unexpected convergence: models optimized for distinct tasks may develop similar representational geometries. We investigate whether this convergence extends to embodied action learning by training a transformer-based agent to execute goal-directed behaviors in response to natural language instructions. Using behavioral cloning on the BabyAI platform, we generated action-grounded language embeddings shaped exclusively by sensorimotor control requirements. We then compared these representations with those extracted from state-of-the-art large language models (LLaMA, Qwen, DeepSeek, BERT) and vision-language models (CLIP, BLIP). Despite substantial differences in training data, modality, and objectives, we observed robust cross-modal alignment. Action representations aligned strongly with decoder-only language models and BLIP (precision@15: 0.70-0.73), approaching the alignment observed among language models themselves. Alignment with CLIP and BERT was significantly weaker. These findings indicate that linguistic, visual, and action representations converge toward partially shared semantic structures, supporting modality-independent semantic organization and highlighting potential for cross-domain transfer in embodied AI systems.

翻译：认知科学与人工智能领域的一个基本问题在于：语言、视觉与动作等不同学习模态是否会产生各自独立或共享的内部表征。传统观点认为，基于不同类型数据训练的模型会形成专门化且不可迁移的表征。然而，近期证据表明存在意料之外的趋同性：为不同任务优化的模型可能发展出相似的表征几何结构。本研究通过训练一个基于Transformer的智能体，使其能够根据自然语言指令执行目标导向行为，来探究这种趋同性是否延伸至具身动作学习。我们在BabyAI平台上采用行为克隆方法，生成了完全由感觉运动控制需求塑造的动作锚定语言嵌入。随后，我们将这些表征与从先进的大语言模型（LLaMA、Qwen、DeepSeek、BERT）及视觉语言模型（CLIP、BLIP）中提取的表征进行比较。尽管在训练数据、模态和目标上存在显著差异，我们观察到了稳健的跨模态对齐。动作表征与仅解码器语言模型及BLIP表现出强烈对齐（精度@15：0.70-0.73），接近语言模型内部自身的对齐程度。而与CLIP和BERT的对齐则显著较弱。这些发现表明，语言、视觉和动作表征趋向于部分共享的语义结构，支持了模态无关的语义组织方式，并凸显了具身人工智能系统中跨领域迁移的潜力。