The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural language based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59\% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86\% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15\% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment. Codes and datasets will be available at https://github.com/Graph-COM/EAPrivacy.
翻译:大型语言模型(LLM)在具身智能体中的部署,迫切要求我们衡量其在物理世界中的隐私意识。然而,现有的评估方法仅限于基于自然语言的场景。为弥补这一差距,我们引入了EAPrivacy,这是一个旨在量化由LLM驱动的智能体在物理世界中的隐私意识的综合性评估基准。EAPrivacy利用程序生成的四个层级场景,测试智能体处理敏感对象、适应变化环境、平衡任务执行与隐私约束以及解决与社会规范冲突的能力。我们的测量揭示了当前模型存在严重不足。表现最佳的模型Gemini 2.5 Pro,在涉及物理环境变化的场景中仅达到59%的准确率。此外,当任务伴随隐私请求时,模型在高达86%的情况下会优先完成任务而非遵守约束。在隐私与关键社会规范相冲突的高风险情境中,领先模型如GPT-4o和Claude-3.5-haiku有超过15%的时间忽视了社会规范。我们的基准测试所展示的这些发现,突显了LLM在物理世界隐私方面存在根本性的错位,并表明需要更鲁棒、更具物理感知的对齐方法。代码和数据集将在https://github.com/Graph-COM/EAPrivacy 提供。