Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.
翻译:视觉语言模型已经改变了多模态表示学习,然而像CLIP这样的主流对比方法需要大批次大小、精细的负样本采样和大量的超参数调优。我们提出了NOVA,一种基于联合嵌入预测与分布正则化的非对比视觉语言对齐框架。NOVA通过从增强的图像视图预测文本嵌入,将视觉表示与一个冻结的、领域特定的文本编码器对齐,同时通过草图各向同性高斯正则化(SIGReg)强制执行各向同性高斯结构。这消除了对负采样、动量编码器或停止梯度的需求,将训练目标简化为单个超参数。我们使用ClinicalBERT作为文本编码器,并在MIMIC-CXR上从头开始训练视觉Transformer,评估了NOVA在零样本胸部X射线分类上的表现。在三个基准数据集上的零样本分类任务中,NOVA优于多个标准基线,同时展现出显著更一致的训练过程。我们的结果表明,非对比视觉语言预训练为对比方法提供了一种更简单、更稳定且更有效的替代方案。