Natural-language instance navigation becomes challenging when the initial user request does not uniquely specify the target instance. A practical agent should reduce the user's burden by actively asking only the information needed to distinguish the target from similar distractors, rather than requiring a detailed description upfront. Existing approaches often fall short of this goal: they may stop at the first plausible candidate before sufficiently exploring alternatives, or, even after collecting multiple candidates, ask about the target's attributes derived from individual candidates rather than questions selected to distinguish candidates in the pool. As a result, despite the dialogue, the agent may still fail to distinguish the target from distractors, leading to premature decisions and lengthy user responses. We propose Proactive Instance Navigation with Comparative Judgment (ProCompNav), a two-stage framework that first constructs a candidate pool and then identifies the target through comparative judgment. At each round, ProCompNav extracts an attribute-value pair that splits the current pool, asks a binary yes/no question, and prunes all inconsistent candidates at once. This reframes disambiguation from open-ended target description to pool-level discriminative questioning, where each question is chosen to narrow the candidate set. On CoIN-Bench, ProCompNav improves Success Rate over interactive baselines with the same minimal input and non-interactive baselines with detailed descriptions, while substantially reducing Response Length. ProCompNav also achieves state-of-the-art Success Rate on TextNav, suggesting that comparative judgment is broadly useful for instance-level navigation among similar distractors.
翻译:自然语言实例导航在初始用户请求未唯一指定目标实例时变得具有挑战性。一个实用的智能体应通过主动询问仅需区分目标与相似干扰项的信息来减轻用户负担,而非要求用户预先提供详细描述。现有方法往往未达到这一目标:它们可能在充分探索备选方案前就停留在第一个看似合理的候选项,即便收集了多个候选项,也倾向于询问源自单个候选项的目标属性,而非选择能够区分候选项池的问题。因此,尽管进行了对话,智能体仍可能无法区分目标与干扰项,导致过早决策与冗长的用户响应。我们提出基于比较判断的主动性实例导航框架(ProCompNav),这是一个两阶段框架:首先构建候选项池,然后通过比较判断识别目标。每轮对话中,ProCompNav提取一个能分割当前候选项池的属性-值对,提出二元是/否问题,并立即剪枝所有不一致的候选项。这将消歧从开放式目标描述重新定义为基于池的判别性提问,其中每个问题都被选择用于缩小候选项集合。在CoIN-Bench基准上,ProCompNav相较于具有相同最小输入的交互式基线和需详细描述的非交互式基线,在显著降低响应长度的同时提升了成功率。同时,ProCompNav在TextNav基准上实现了最先进的成功率,表明比较判断在相似干扰项间的实例级导航中具有广泛适用性。