Affordance grounding requires identifying where and how an agent should interact in open-world scenes, where actionable regions are often small, occluded, reflective, and visually ambiguous. Recent systems therefore combine multiple skills (e.g., detection, segmentation, interaction-imagination), yet most orchestrate them with fixed pipelines that are poorly matched to per-instance difficulty, offer limited targeted recovery from intermediate errors, and fail to reuse experience from recurring objects. These failures expose a systems problem: test-time grounding must acquire the right evidence, decide whether that evidence is reliable enough to commit, and do so under bounded inference cost without access to labels. We propose Affordance Agent Harness, a closed-loop runtime that unifies heterogeneous skills with an evidence store and cost control, retrieves episodic memories to provide priors for recurring categories, and employs a Router to adaptively select and parameterize skills. An affordance-specific Verifier then gates commitments using self-consistency, cross-scale stability, and evidence sufficiency, triggering targeted retries before a final judge fuses accumulated evidence and trajectories into the prediction. Experiments on multiple affordance benchmarks and difficulty-controlled subsets show a stronger accuracy-cost Pareto frontier than fixed-pipeline baselines, improving grounding quality while reducing average skill calls and latency. Project page: https://tenplusgood.github.io/a-harness-page/.
翻译:可行性定位要求识别开放世界场景中代理应与环境交互的位置与方式,其中可操作区域往往尺寸小、存在遮挡、具有反光性且视觉歧义性强。现有系统常需整合多种功能(如检测、分割、交互想象),但多数采用固定流水线编排策略,既难以适配不同实例的难度差异,又缺乏针对中间错误的目标性恢复机制,更无法复用反复出现物体的经验。这些不足暴露了一个系统级问题:测试阶段的定位必须在无标签条件下,在可控推理成本内获取正确证据、判定证据可靠性是否足以做出承诺。我们提出可行性代理鞍座,该闭环运行时系统通过证据存储与成本控制统一异构技能,通过检索情节记忆为重复类别提供先验知识,并采用路由器自适应选择与参数化技能。随后,可行性专用验证器利用自洽性、跨尺度稳定性及证据充分性门控承诺,在最终判断器将累积证据与轨迹融合为预测结果前触发针对性重试。在多个可行性基准及难度控制子集上的实验表明,本方法相较固定流水线基线实现了更强的准确率-成本帕累托边界,在降低平均技能调用次数与延迟的同时提升了定位质量。项目页面:https://tenplusgood.github.io/a-harness-page/。