Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable success in computer vision and particularly demonstrated superior robustness to distribution shifts of 2D images. However, their robustness under 3D viewpoint variations is still limited, which can hinder the development for real-world applications. This paper successfully addresses this concern while keeping VLPs' original performance by breaking through two primary obstacles: 1) the scarcity of training data and 2) the suboptimal fine-tuning paradigms. To combat data scarcity, we build the Multi-View Caption (MVCap) dataset -- a comprehensive collection of over four million multi-view image-text pairs across more than 100K objects, providing more potential for VLP models to develop generalizable viewpoint-invariant representations. To address the limitations of existing paradigms in performance trade-offs and training efficiency, we design a novel fine-tuning framework named Omniview-Tuning (OVT). Specifically, OVT introduces a Cross-Viewpoint Alignment objective through a minimax-like optimization strategy, which effectively aligns representations of identical objects from diverse viewpoints without causing overfitting. Additionally, OVT fine-tunes VLP models in a parameter-efficient manner, leading to minimal computational cost. Extensive experiments on various VLP models with different architectures validate that OVT significantly improves the models' resilience to viewpoint shifts and keeps the original performance, establishing a pioneering standard for boosting the viewpoint invariance of VLP models.
翻译:视觉-语言预训练(VLP)模型(如CLIP)在计算机视觉领域取得了显著成功,尤其在2D图像的分布偏移中展现出卓越的鲁棒性。然而,它们在3D视角变化下的鲁棒性仍有限,这阻碍了实际应用的发展。本文通过突破两大核心障碍——1)训练数据稀缺性及2)次优的微调范式——成功解决了这一问题,同时保持了VLP模型的原始性能。为解决数据稀缺问题,我们构建了多视角描述(MVCap)数据集——一个涵盖超过10万个物体、包含四百多万个多视角图像-文本对的综合集合,为VLP模型发展泛化性视角不变表征提供了更多潜力。针对现有范式在性能权衡与训练效率上的局限性,我们设计了一种名为全视角微调(OVT)的新型微调框架。具体而言,OVT通过极小化极大优化策略引入跨视角对齐目标,有效对齐不同视角下同一物体的表征而不引发过拟合。此外,OVT以参数高效的方式微调VLP模型,大幅降低计算成本。在不同架构的多种VLP模型上的大量实验验证表明,OVT显著提升了模型对视角变化的鲁棒性并保持原始性能,为提升VLP模型的视角不变性建立了开创性标准。