State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.
翻译:最先进的视觉语言模型在理解否定语义方面存在严重失败,常被称为肯定偏差。这一局限性在描述性物体检测任务中尤为突出。为解决此问题,我们提出两项核心贡献:(1)新型数据集构建流程;(2)轻量级自适应方法。首先,我们引入CoVAND数据集,该数据集通过系统化思维链与基于VQA的流程生成高质量、实例级对齐的否定数据。其次,我们提出NegToMe模块,这是一种新型文本Token合并模块,直接针对肯定偏差的架构根源。NegToMe从根本上解决了分词阶段否定语义线索的结构性丢失问题,将其与属性合并为连贯的语义短语,并在输入层面保持正确极性,即使数据量有限也能实现鲁棒否定理解。例如,为防止模型将碎片化Token“not”与“girl”误判为“girl”,NegToMe将其绑定为语义区别于单独“girl”的单个Token。该模块与参数高效、策略性的LoRA微调方法集成。我们的方法在具有挑战性的否定基准测试中显著提升了性能,降低了误检率,在OVDEval上使NMS-AP提升高达+10.8个百分点,并展现出对最先进VLM的泛化能力。这项工作为现实检测应用中的否定语义理解迈出了关键一步。