Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Our code will be made publicly available.
翻译:诸如CLIP等对比视觉-语言预训练框架已在多种视觉-语言任务中展现出卓越的零样本性能。近期研究表明,将单个文本词元与特定图像块或区域对齐可增强细粒度组合理解能力。然而,捕捉跨越多个图像区域的组合语义仍具挑战性。为突破此局限,我们提出PowerCLIP——一种通过幂集对齐增强的新型对比预训练框架,该方法通过最小化图像区域幂集与文本解析树幂集之间定义的损失函数,穷尽优化区域到短语的对齐关系。由于朴素幂集构造会因区域子集数量的组合爆炸导致指数级计算成本,我们引入高效非线性聚合器,将复杂度从O(2^M)降至O(M)(M为区域数量),同时以任意精度逼近精确损失值。大量实验表明,PowerCLIP在零样本分类与检索任务中优于现有最优方法,凸显了本方法的组合性与鲁棒性。代码将公开提供。