Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs. We study this behavior as a generation and control surface. In Gemma 4 12B, high-capacity q/k/v/o LoRA raises class-aware [email protected] from 0.007 to 0.448 while inducing repeated-tail pressure (duplicate rate 0.080, max repeat 23). A q/v rank sweep keeps max repeat at 21-22 across ranks 4-64, showing capacity persistence. The target signal is separable: object-level repeat-stop removes exact repeated records (duplicate rate 0.000, max repeat 1) while preserving F1 (0.494 to 0.490) and stricter [email protected] (0.381 to 0.385). Structure-axis probes localize the effect to bbox-coordinate object lists; dense non-bbox and spatial/count JSON remain repeat-clean, including under high-capacity adapters. Qwen3-VL-8B reproduces a clean controlled endpoint ([email protected] 0.318, duplicate rate 0.000), and COCO 2017 reproduces acquisition plus duplicate pressure. Dense coordinate-list adaptation therefore creates a structure-bound, cross-family interference surface that can be measured and controlled.
翻译:对视觉语言模型进行密集坐标列表微调,虽提升了视觉定位能力,但同时也改变了模型对结构化输出的序列化、重复及终止行为。本研究将此类行为视为一种生成与控制面加以分析。在Gemma 4 12B模型中,高容量q/k/v/o LoRA将类别感知[email protected]从0.007提升至0.448,同时引发重复尾部压力(重复率0.080,最大重复次数23)。q/v秩扫描实验显示,在秩范围4-64内,最大重复次数维持在21-22之间,体现出容量持久性。目标信号具有可分性:目标级重复移除机制可消除精确重复记录(重复率0.000,最大重复次数1),同时保持F1值(从0.494至0.490)与更严格的[email protected]值(从0.381至0.385)基本不变。结构轴探针实验将该效应定位于边界框坐标对象列表;而密集非边界框及空间/计数的JSON在包括高容量适配器在内的条件下仍保持无重复特性。Qwen3-VL-8B模型重现了干净可控的端点性能([email protected]为0.318,重复率0.000),COCO 2017数据集也重现了性能提升与重复压力的获取过程。因此,密集坐标列表适配会形成一个结构受限、跨家族的干涉面,该干涉面可被测量与调控。