Visual abductive reasoning aims to make likely explanations for visual observations. We propose a simple yet effective Region Conditioned Adaptation, a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer explanations from local visual cues. We encode ``local hints'' and ``global contexts'' into visual prompts of the CLIP model separately at fine and coarse-grained levels. Adapters are used for fine-tuning CLIP models for downstream tasks and we design a new attention adapter, that directly steers the focus of the attention map with trainable query and key projections of a frozen CLIP model. Finally, we train our new model with a modified contrastive loss to regress the visual feature simultaneously toward features of literal description and plausible explanations. The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the \nth{1} on the leaderboards (e.g., Human Acc: RCA 31.74 \textit{vs} CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We open-source our project at \textit{\color{magenta}{\url{https://github.com/LUNAProject22/RPA}}}.
翻译:视觉溯因推理旨在为视觉观察提供可能的解释。我们提出了一种简单而有效的区域条件适应方法,这是一种混合参数高效微调方法,使冻结的CLIP模型能够从局部视觉线索推断解释。我们分别在细粒度和粗粒度级别将“局部提示”和“全局上下文”编码到CLIP模型的视觉提示中。适配器用于在下游任务中微调CLIP模型,我们设计了一种新的注意力适配器,该适配器通过可训练的查询和键投影直接引导冻结CLIP模型的注意力图聚焦。最后,我们使用改进的对比损失训练新模型,使视觉特征同时回归到字面描述特征和合理解释特征。该损失使CLIP能够保持感知和推理能力。在Sherlock视觉溯因推理基准测试上的实验表明,RCA显著优于先前的最先进方法,在排行榜上位列第一(例如,人类准确率:RCA 31.74 \textit{对比} CPT-CLIP 29.58,数值越高越好)。我们还验证了RCA可推广至RefCOCO等局部感知基准测试。我们在\textit{\color{magenta}{\url{https://github.com/LUNAProject22/RPA}}}开源了本项目。