Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
翻译:从复杂场景中检索用户指定对象仍然是一项具有挑战性的任务,尤其是在查询存在歧义或涉及多个相似对象时。现有的开放词汇检测器以一次性方式运行,缺乏基于用户反馈精化预测的能力。为解决此问题,我们提出了IntRec,一个基于用户反馈精化预测的交互式对象检索框架。其核心是一个意图状态,该状态维护着用于正锚点(已确认线索)和负约束(已拒绝假设)的双重记忆集。一个对比对齐函数通过最大化与正线索的相似性同时惩罚被拒绝的假设来对候选对象进行排序,从而能够在杂乱场景中进行细粒度的消歧。我们的交互式框架在没有额外监督的情况下显著提高了检索精度。在LVIS数据集上,IntRec实现了35.4 AP,分别比OVMR、CoDet和CAKE高出+2.3、+3.7和+0.5。在具有挑战性的LVIS-Ambiguous基准测试中,经过单次纠正反馈后,其性能比一次性基线提高了+7.9 AP,且每次交互增加的延迟小于30毫秒。