A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, recent investigations find that most-if not all-our state-of-the-art vision-language models struggle at compositionality. They are unable to distinguish between images of " a girl in white facing a man in black" and "a girl in black facing a man in white". Moreover, prior work suggests that compositionality doesn't arise with scale: larger model sizes or training data don't help. This paper develops a new iterated training algorithm that incentivizes compositionality. We draw on decades of cognitive science research that identifies cultural transmission-the need to teach a new generation-as a necessary inductive prior that incentivizes humans to develop compositional languages. Specifically, we reframe vision-language contrastive learning as the Lewis Signaling Game between a vision agent and a language agent, and operationalize cultural transmission by iteratively resetting one of the agent's weights during training. After every iteration, this training paradigm induces representations that become "easier to learn", a property of compositional languages: e.g. our model trained on CC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in the SugarCrepe benchmark.
翻译:人类视觉与自然语言共有的基本特性是其组合性。然而,尽管大规模视觉与语言预训练带来了性能提升,近期研究发现,当前最先进的视觉-语言模型(即使并非全部)在处理组合性时仍存在困难。它们无法区分"穿白裙的女孩面对黑衣男子"与"穿黑衣的女孩面对白衣男子"这类图像。此外,前期研究表明组合性不会随规模增大而自然产生:更大的模型参数量或训练数据量均无济于事。本文提出一种新的迭代训练算法,通过激励机制提升组合性。我们借鉴认知科学领域数十载的研究成果——文化传播(即需要向新一代传授知识的机制)是促使人类发展组合性语言的必要归纳先验。具体而言,我们将视觉-语言对比学习重构为视觉智能体与语言智能体之间的刘易斯信号博弈,并通过在训练过程中迭代重置其中一个智能体的权重来具象化文化传播过程。每轮迭代后,该训练范式会诱导产生"更易学习"的表征——这正是组合性语言的核心特性:例如,在CC3M和CC12M数据集上训练的模型,在SugarCrepe基准测试中分别比标准CLIP模型提升4.7%和4.0%。