Grounded Conversation Generation (GCG) is an emerging vision-language task that requires models to generate natural language responses seamlessly intertwined with corresponding object segmentation masks. Recent models, such as GLaMM and OMG-LLaVA, achieve pixel-level grounding but incur significant computational costs due to processing a large number of visual tokens. Existing token pruning methods, like FastV and PyramidDrop, fail to preserve the local visual features critical for accurate grounding, leading to substantial performance drops in GCG tasks. To address this, we propose Adaptive Local-Aware Token Pruning (ALTP), a simple yet effective framework that accelerates GCG models by prioritizing local object information. ALTP introduces two key components: (1) Detail Density Capture (DDC), which uses superpixel segmentation to retain tokens in object-centric regions, preserving fine-grained details, and (2) Dynamic Density Formation (DDF), which dynamically allocates tokens based on information density, ensuring higher retention in semantically rich areas. Extensive experiments on the GranDf dataset demonstrate that ALTP significantly outperforms existing token pruning methods, such as FastV and PyramidDrop, on both GLaMM and OMG-LLaVA models. Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokens with a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared to PyramidDrop. Similarly, on OMG-LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0% at a 90% token reduction compared with PDrop.
翻译:基于视觉的对话生成是一项新兴的视觉-语言任务,要求模型生成与相应物体分割掩码自然交织的文本回复。近期模型(如GLaMM和OMG-LLaVA)虽实现了像素级视觉定位,但因处理大量视觉令牌而产生显著计算开销。现有令牌剪枝方法(如FastV和PyramidDrop)未能保留对精确定位至关重要的局部视觉特征,导致GCG任务性能大幅下降。为此,我们提出自适应局部感知令牌剪枝框架,该框架通过优先处理局部物体信息,以简洁高效的方式加速GCG模型。ALTP包含两个核心组件:(1)细节密度捕获模块,利用超像素分割保留以物体为中心区域的令牌,维持细粒度细节;(2)动态密度形成模块,根据信息密度动态分配令牌,确保语义丰富区域获得更高保留率。在GranDf数据集上的大量实验表明,ALTP在GLaMM和OMG-LLaVA模型上均显著优于现有令牌剪枝方法(如FastV和PyramidDrop)。值得注意的是,在GLaMM模型上应用ALTP时,相比PyramidDrop,在减少90%视觉令牌的同时,AP50提升4.9%,召回率提升5.0%。在OMG-LLaVA模型上,与PDrop相比,ALTP在减少90%令牌的情况下,AP提升2.1%,mIOU提升3.0%。