TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

ZhiYuan Feng,Yu Deng,Ruichuan An,Zhenhua Liu,Qixiu Li,Keming Wu,Zhiying Du,Weijie Wang,Haoxiao Wang,Shuang Chen,Sicheng Xu,Yaobo Liang,Jiaolong Yang,Baining Guo

from arxiv, Project page: https://aaronfengzy.github.io/TaskGround/

In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.

翻译：在真实家庭部署中，家庭智能体通常需要基于完整的家庭场景和情境化的家庭请求进行操作，而非依赖清晰的任务规范。此类请求要求智能体识别与任务相关的实体、恢复预期的任务条件，并从周围场景上下文解决顺序约束问题。我们将这种能力形式化为全场景家庭推理：给定完整的家庭场景和情境化的家庭请求，智能体必须在生成基于具身化技能的原子动作序列前推断出可执行的任务结构。这一设置极具挑战性，因为完整家庭场景包含大量与任务无关的信息，直接对完整场景进行提示效率低下且易出错。在实际部署中，这一挑战因隐私和本地计算约束而进一步加剧，这些约束更倾向于使用长上下文推理能力有限的紧凑型开源模型。我们提出任务地（TaskGround），一种无需训练且与模型无关的“基础化-推断-执行”（Ground-Infer-Execute）框架，该框架将完整场景基础化为紧凑的任务相关场景切片，推断可执行任务结构，并将其编译为基于具身化技能的原子动作序列。为评估这一设置，我们引入全屋（FullHome）——一个经人工验证的评估套件，包含400个家庭任务，覆盖多样化的家庭环境规模及目标导向型与过程约束型需求。在全屋（FullHome）上，任务地（TaskGround）在专有模型和开源模型上均大幅提升了任务成功率。值得注意的是，它使Qwen3.5-9B在直接完整场景提示下与GPT-5具有竞争力，同时将总输入令牌成本降低高达18倍。我们的研究结果将可执行任务结构推断识别为全场景家庭推理的主要瓶颈，并表明结构化基础化可使紧凑型本地模型在实际家庭部署中显著更高效。