The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.
翻译:智能体处理更长时程与更复杂任务的能力持续增强,在编程、深度研究及复杂问题求解评估中展现出卓越性能。然而在日常场景中,普通用户对这些先进AI能力的感知仍显有限。我们认为,当前评估体系过于侧重提升任务难度,未能充分涵盖广泛人群日常工作、生活与学习所需的多样化智能体任务。为此,我们提出AgentIF-OneDay基准,旨在探究普通用户能否通过自然语言指令与智能体完成多样化的日常任务。这些任务不仅需要通过对话解决问题,还要求理解多种附件类型并交付可感知的文件化成果。该基准围绕三个以用户为中心的类别构建:开放工作流执行——评估对显式复杂工作流的遵循能力;潜在指令——要求智能体从附件中推断隐含指令;迭代优化——涉及对进行中工作的修改与扩展。我们采用实例级量规与精细化评估流程,使基于大语言模型的验证与人工判断保持一致,使用Gemini-3-Pro实现了80.1%的一致性评分。AgentIF-OneDay包含104项任务,覆盖767个评分点。我们对四款主流通用智能体进行基准测试,发现基于API构建的智能体产品与基于智能体强化学习的ChatGPT智能体同时处于第一梯队。领先的大语言模型API与开源模型已内化智能体能力,使AI应用团队得以开发前沿的智能体产品。