AFFORDANCE20Q: Evaluating Affordance Reasoning from Physical Properties

Affordance reasoning, the inference of an object's action possibilities from its physical properties (e.g., shape and material), is fundamental to human physical understanding and increasingly critical for Large Language Models (LLMs). However, existing affordance benchmarks largely expose explicit object identities in the evaluation setup, allowing models to rely on memorized object-affordance mappings rather than reasoning over physical properties. To address this gap, we introduce Affordance20Q, a novel affordance reasoning benchmark formulated as a 20-Questions game without exposing the object's identity. In each game, the model identifies a hidden object's affordance from a candidate set by asking yes/no questions about its physical properties. Affordance20Q comprises 1,009 games over 454 objects and 59 affordances, all manually filtered, refined, and annotated. We conduct comprehensive experiments with 15 state-of-the-art LLMs and find a substantial gap (~20 points) compared to human performance. A KL-based information-gain (IG) analysis further shows that models fail to ask discriminating questions as the game progresses. To close the gap, we develop KB-Anchored Rule Induction (KARI), a pipeline based on LLMs that generates affordance rules grounded in evidence from knowledge bases (KBs). KARI improves open-source LLMs by up to 15.2 points, while the limited coverage of KBs hinders further gains. We release all our code and data at https://github.com/1171-jpg/Affordance20Q.git

翻译：可操作推理（Affordance reasoning）是指从物体的物理属性（如形状和材质）推断其动作可能性的过程，这是人类物理理解的基础，并且对大型语言模型（LLMs）日益重要。然而，现有可操作基准测试大多在评估设置中显式暴露物体身份，使模型能够依赖记忆中的物体-可操作映射关系，而非基于物理属性进行推理。为填补这一空白，我们提出了Affordance20Q，一种新颖的可操作推理基准测试，以20问游戏形式设计且不暴露物体身份。在每个游戏中，模型通过询问关于物理属性的是/否问题，从候选集合中识别隐藏物体的可操作性。Affordance20Q包含基于454个物体和59种可操作性构建的1,009场游戏，所有数据均经过人工筛选、精炼与标注。我们使用15个最先进的LLM进行了全面实验，发现其与人类表现存在显著差距（约20个点）。基于KL散度的信息增益（IG）分析进一步表明，随着游戏进行，模型未能提出具有区分性的问题。为缩小这一差距，我们提出了基于知识库锚定的规则归纳（KARI），这是一种基于LLM的流水线方法，能够生成源自知识库（KB）证据的可操作规则。KARI将开源LLM的性能提升了至多15.2个点，但知识库覆盖范围有限限制了进一步改进。我们已在https://github.com/1171-jpg/Affordance20Q.git 发布所有代码与数据。