StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: https://github.com/intelligolabs/StructXLIP.

翻译：基于边缘的表示是视觉理解的基本线索，这一原则源于早期视觉研究，至今仍是核心。我们将这一原则扩展到视觉语言对齐，表明跨模态分离和对齐结构线索可以极大地受益于对长且细节丰富的描述的微调，特别侧重于改进跨模态检索。我们引入了StructXLIP，一种微调对齐范式，该范式提取边缘图（例如Canny），将其视为图像视觉结构的代理，并过滤相应的描述以强调结构线索，使其成为“以结构为中心”的。微调通过三种以结构为中心的损失增强了标准对齐损失：(i) 将边缘图与结构文本对齐，(ii) 将局部边缘区域与文本块匹配，以及(iii) 将边缘图与彩色图像连接以防止表示漂移。从理论角度来看，标准CLIP最大化视觉和文本嵌入之间的互信息，而StructXLIP额外最大化多模态结构表示之间的互信息。这种辅助优化本质上更困难，引导模型朝向更鲁棒和语义更稳定的最小值，从而增强视觉语言对齐。除了在通用和专门领域的跨模态检索中优于当前竞争对手外，我们的方法作为一种通用的增强方案，可以以即插即用的方式集成到未来方法中。代码和预训练模型公开于：https://github.com/intelligolabs/StructXLIP。