Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation.
翻译:基于Transformer的多模态大语言模型通常展现出上下文学习能力。受此现象启发,我们提出:Transformer如何通过上下文示例学习跨模态的信息关联?我们通过在合成分类任务上训练的小型Transformer进行受控实验来研究这个问题,从而实现对数据统计和模型架构的精确操控。我们首先重新审视现代Transformer中单模态上下文学习的核心原理。虽然部分先前结论得到复现,但我们发现旋转位置编码提高了上下文学习所需的数据复杂度阈值。扩展到多模态场景揭示了一个根本性的学习不对称性:当在具有高多样性的主模态数据上进行预训练后,次模态仅需极低的数据复杂度即可催生多模态上下文学习。机理分析表明,两种场景都依赖于一种归纳式机制——从匹配的上下文示例中复制标签;多模态训练则能跨模态优化并扩展这些神经回路。我们的研究结果为理解现代Transformer中的多模态上下文学习提供了机理基础,并为未来研究引入了受控测试平台。