Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Code is available at https://github.com/Masakichi210/PowerCLIP.
翻译:对比性视觉-语言预训练框架(如CLIP)在多项视觉-语言任务的零样本性能上展现出显著成果。近期研究表明,将文本词元与特定图像块或区域进行对齐能够增强细粒度的组合理解能力。然而,捕捉跨多个图像区域的组合语义仍具挑战性。为解决这一局限,我们提出PowerCLIP——一种通过幂集对齐增强的新型对比预训练框架。该框架通过最小化图像区域幂集与文本解析树之间定义的损失,全局优化区域-短语对齐。由于朴素幂集构造会因区域子集数量激增导致指数级计算复杂度,我们引入高效非线性聚合器(NLA),在保证近似损失值任意精度的前提下,将复杂度从O(2^M)降至O(M)(M为区域数量)。大量实验表明,PowerCLIP在零样本分类与检索任务中优于现有方法,充分证明了本方法的组合性与鲁棒性。代码已开源:https://github.com/Masakichi210/PowerCLIP。