Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms: shared expectations that constrain what actions are appropriate in context. A key capability in such settings is norm-based reference resolution (NBRR), where interpreting referential expressions requires inferring implicit normative expectations grounded in physical and social context. Yet it remains unclear whether Large Language Models (LLMs) can support this kind of reasoning. In this work, we introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR. SNIC emphasizes physically grounded norms that arise in everyday tasks such as cleaning, tidying, and serving. Across a range of controlled evaluations, we find that even the strongest LLMs struggle to consistently identify and apply social norms, particularly when norms are implicit, underspecified, or in conflict. These findings reveal a blind spot in current LLMs and highlight a key challenge for deploying language-based systems in socially situated, embodied settings.
翻译:具身智能体(如机器人)需要在情境化环境中进行交互,其中成功的沟通往往依赖于对社会规范的推理——这些共享的预期约束了特定情境下哪些行为是恰当的。在此类场景中,一项关键能力是基于规范的指代消解:理解指代表达式需要推断植根于物理和社会情境的隐性规范预期。然而,目前尚不清楚大语言模型是否支持此类推理。本研究提出了SNIC(情境中的规范),这是一个经过人工验证的诊断测试平台,旨在探究前沿大语言模型提取和利用与NBRR相关的规范原则的能力。SNIC重点关注清洁、整理、服务等日常任务中出现的物理情境化规范。通过一系列受控评估,我们发现即使最强大的大语言模型也难以持续识别和应用社会规范,尤其是当规范具有隐含性、未充分明确或相互冲突时。这些发现揭示了当前大语言模型的盲点,并凸显了在社交情境化的具身环境中部署基于语言的系统所面临的关键挑战。