Open-set object detection (OSOD) localizes objects while identifying and rejecting unknown classes at inference. While recent OSOD models perform well on benchmarks, their behavior under realistic user prompting remains underexplored. In interactive XR settings, user-generated prompts are often ambiguous, underspecified, or overly detailed. To study prompt-conditioned robustness, we evaluate two OSOD models, GroundingDINO and YOLO-E, on real-world XR images and simulate diverse user prompting behaviors using vision-language models. We consider four prompt types: standard, underdetailed, overdetailed, and pragmatically ambiguous, and examine the impact of two enhancement strategies on these prompts. Results show that both models exhibit stable performance under underdetailed and standard prompts, while they suffer degradation under ambiguous prompts. Overdetailed prompts primarily affect GroundingDINO. Prompt enhancement substantially improves robustness under ambiguity, yielding gains exceeding 55% mIoU and 41% average confidence. Based on the findings, we propose several prompting strategies and prompt enhancement methods for OSOD models in XR environments.
翻译:开放集目标检测(OSOD)能够在推理过程中定位目标,同时识别并拒识未知类别。尽管近期的OSOD模型在基准测试上表现良好,但它们在真实用户提示下的行为仍未得到充分探索。在交互式XR环境中,用户生成的提示往往存在模糊性、描述不足或过度详细的问题。为研究提示条件下的鲁棒性,我们在真实世界XR图像上评估了GroundingDINO和YOLO-E两种OSOD模型,并利用视觉语言模型模拟了多样化的用户提示行为。我们考虑了四种提示类型:标准提示、描述不足提示、过度详细提示以及语用模糊提示,并检验了两种增强策略对这些提示的影响。结果表明,两种模型在描述不足提示和标准提示下均表现出稳定的性能,而在模糊提示下性能则出现下降。过度详细提示主要影响GroundingDINO模型。提示增强能显著提升模型在模糊提示下的鲁棒性,使mIoU提升超过55%,平均置信度提升超过41%。基于这些发现,我们为XR环境中的OSOD模型提出了若干提示策略与提示增强方法。