Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.
翻译:视觉-语言模型在多模态推理任务中展现出卓越性能,但在细粒度视觉属性解耦与底层因果关系推理方面仍存在局限。上下文学习为视觉-语言模型适应新任务提供了可行路径,但其效能高度依赖于演示样本的选择。现有检索增强方法通常依赖被动的基于相似性的检索策略,倾向于选择相关但非因果的样本,这会放大伪关联并限制模型鲁棒性。本文提出CIRCLES(面向因果学习的组合图像检索框架),该框架通过属性引导的组合图像检索主动构建反事实风格的演示样本集。通过引入反事实样本,CIRCLES使视觉-语言模型能够隐式推理属性与结果间的因果关系,突破表层关联的局限,实现更鲁棒且可解释的推理。在四个异构数据集上的综合实验表明,CIRCLES在多种模型架构上均优于现有方法,尤其在小规模模型上表现显著,在信息稀缺场景下提升尤为明显。此外,CIRCLES检索的样本具有更高的多样性与因果信息量,为模型如何利用上下文演示提升推理能力提供了定性分析依据。代码已开源:https://github.com/gzxiong/CIRCLES。