Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, leading to rapid advancements in multimodal studies. However, CLIP faces a notable challenge in terms of inefficient data utilization. It relies on a single contrastive supervision for each image-text pair during representation learning, disregarding a substantial amount of valuable information that could offer richer supervision. Additionally, the retention of non-informative tokens leads to increased computational demands and time costs, particularly in CLIP's ViT image encoder. To address these issues, we propose Multi-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the frequency transform's sensitivity to both high and low-frequency variations, which complements the spatial domain's sensitivity limited to low-frequency variations only. By incorporating frequency transforms and token-level alignment, we expand CILP's single supervision into multi-domain and multi-level supervision, enabling a more thorough exploration of informative image features. Additionally, we introduce a token merging method guided by comprehensive semantics from the frequency and spatial domains. This allows us to merge tokens to multi-granularity tokens with a controllable compression rate to accelerate CLIP. Extensive experiments validate the effectiveness of our design.
翻译:对比语言-图像预训练(CLIP)已取得显著成功,推动了多模态研究的快速发展。然而,CLIP在数据利用效率方面面临显著挑战。其在表示学习过程中仅依赖每个图像-文本对的单一对比监督,忽视了大量可提供更丰富监督的宝贵信息。此外,非信息性标记的保留导致计算需求和时耗增加,这在CLIP的ViT图像编码器中尤为明显。为解决这些问题,我们提出多视角语言-图像预训练(MLIP)。在MLIP中,我们利用频率变换对高频与低频变化的双重敏感性,以弥补空间域仅对低频变化敏感的局限性。通过引入频率变换与标记级对齐,我们将CILP的单层监督扩展为多域多级监督,从而实现对信息性图像特征更彻底的挖掘。此外,我们提出一种由频率域与空间域综合语义引导的标记融合方法。该方法允许我们将标记融合为具有可控压缩率的多粒度标记,以加速CLIP训练。大量实验验证了我们设计的有效性。