Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO
翻译:多模态大语言模型在OdinW-13和RefCOCO等主流目标检测基准上展现出强大的视觉定位能力。然而,当前最先进的模型仍难以泛化到预训练中不常见的分布外类别、任务和成像模态。虽然上下文提示是提升跨任务性能的常用策略,但我们发现其检测精度往往低于仅使用类别名称提示的方法。这表明当前多模态大语言模型尚无法有效利用少样本视觉示例和丰富文本描述进行目标检测。由于前沿多模态大语言模型通常仅通过API访问,且最先进的开源模型在消费级硬件上的微调成本过高,我们转而探索用于少样本目标检测的黑盒提示优化。为此,本文提出检测提示优化方法——一种无梯度的测试时优化方法,通过最大化少样本视觉训练示例上的检测精度并校准预测置信度,来优化纯文本提示。所提方法在Roboflow20-VL和LVIS上对通用多模态大语言模型取得了持续改进,性能较先前黑盒方法提升高达9.7%。我们的代码开源于https://github.com/ggare-cmu/DetPO