We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being trained on images and text simultaneously. For one, we show that leading frontier models can generate near-perfect reproductions of iconic movie artwork, but confuse crucial details when asked for textual descriptions. We corroborate those findings through controlled experiments on synthetic datasets in multiple architectures. Our experiments confirm that modal aphasia reliably emerges as a fundamental property of current unified multimodal models, not just as a training artifact. In practice, modal aphasia can introduce vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities. We demonstrate this risk by showing how a model aligned solely on text remains capable of generating unsafe images.
翻译:我们提出模态失语这一系统性分离现象:当前统一多模态模型虽经图像与文本联合训练,能准确视觉记忆概念,却无法通过文字表述这些概念。例如,前沿领先模型能生成近乎完美的标志性电影海报图像复现,但在被要求进行文字描述时却混淆关键细节。我们通过在多种架构的合成数据集上进行受控实验,验证了这些发现。实验证实,模态失语是当前统一多模态模型的基本属性,而非训练伪影。实践中,模态失语可能给AI安全框架带来漏洞,因为针对单一模态的安全防护可能使有害概念在其他模态中仍可被访问。我们通过展示仅经文本对齐的模型仍能生成不安全图像,证明了这一风险。