Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.
翻译:大语言模型智能体正日益被视为能够访问用户数字世界中任何相关信息的始终在线个人助手。然而,当前系统仅能运行于数字世界的狭窄切片之上,限制了情境敏感推理与有效协助。现有基准测试同样仅提供部分用户状态,因此无法在如此广泛且始终在线的设定中捕获系统性能。为填补这一空白,我们提出Claw-Anything基准测试,该基准从三个维度扩展智能体上下文:长跨度活动历史记录、相互依赖的后端服务,以及跨多设备的图形用户界面与命令行界面集成交互。为实例化这一设定,我们通过多轮事件注入模拟数月的用户活动,产生复杂的世界状态与真实噪声,包括无关事件与冲突信号。智能体必须在丰富的上下文环境中进行推理,同时保持对此类噪声的鲁棒性。这一扩展范围还支持对主动协助的评估,要求智能体预判用户需求并给出及时建议。实验表明,GPT-5.5仅达到34.5%的pass@1,远低于先前基准测试,凸显了当前智能体能力与始终在线个人协助需求之间的差距。与该基准测试一同发布的还有一条自动化数据生成流水线,该流水线可生成2000个训练环境,并将基础模型性能提升23.7%,展示了可扩展数据基础设施的有效性。