Multi-modal generative AI systems, such as those combining vision and language, rely on contrastive pre-training to learn representations across different modalities. While their practical benefits are widely acknowledged, a rigorous theoretical understanding of the contrastive pre-training framework remains limited. This paper develops a theoretical framework to explain the success of contrastive pre-training in downstream tasks, such as zero-shot classification, conditional diffusion models, and vision-language models. We introduce the concept of approximate sufficient statistics, a generalization of the classical sufficient statistics, and show that near-minimizers of the contrastive pre-training loss are approximately sufficient, making them adaptable to diverse downstream tasks. We further propose the Joint Generative Hierarchical Model for the joint distribution of images and text, showing that transformers can efficiently approximate relevant functions within this model via belief propagation. Building on this framework, we derive sample complexity guarantees for multi-modal learning based on contrastive pre-trained representations. Numerical simulations validate these theoretical findings, demonstrating the strong generalization performance of contrastively pre-trained transformers in various multi-modal tasks.
翻译:结合视觉与语言的多模态生成式人工智能系统依赖于对比预训练来学习跨不同模态的表征。尽管其实际效益已得到广泛认可,但关于对比预训练框架的严格理论理解仍然有限。本文构建了一个理论框架,以解释对比预训练在下游任务(如零样本分类、条件扩散模型和视觉-语言模型)中取得成功的原因。我们引入了近似充分统计量的概念,这是对经典充分统计量的推广,并证明了对比预训练损失的近似最小化解是近似充分的,这使得它们能够适应多样化的下游任务。我们进一步提出了用于图像与文本联合分布的联合生成分层模型,并证明Transformer能够通过信念传播在该模型中高效地逼近相关函数。基于此框架,我们推导了基于对比预训练表征的多模态学习的样本复杂度保证。数值模拟验证了这些理论发现,展示了经过对比预训练的Transformer在各种多模态任务中强大的泛化性能。