Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP$_r$ on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.
翻译:开放词汇定位需要在弱监督条件下实现精确的视觉语言对齐,然而现有方法要么依赖缺乏细粒度表达能力的全局句子嵌入,要么需要引入显式监督或复杂交叉注意力设计来实现词元级对齐。本文提出ExpAlign,一个基于理论推导的视觉语言对齐框架,其建立在原则性的多示例学习表述之上。ExpAlign引入一个期望对齐头,该模块通过对词元-区域相似度进行基于注意力的软多示例学习池化,能够在无需额外标注的情况下实现隐式的词元与实例选择。为进一步稳定对齐学习过程,我们提出一种基于能量的多尺度一致性正则化方案,包括一个Top-K多正例对比目标,以及一个从拉格朗日约束自由能最小化推导出的几何感知一致性目标。大量实验表明,ExpAlign持续提升了开放词汇检测和零样本实例分割性能,尤其在长尾类别上表现突出。最显著的是,该方法在LVIS minival子集上取得了36.2 AP$_r$,在可比的模型规模下超越了其他最先进方法,同时保持轻量级和推理高效的特点。