The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.
翻译:智能体处理日益增长的任务时长和复杂度的能力持续增强,在编码、深度研究和复杂问题解决评估中展现出卓越性能。然而在日常场景中,普通用户对这些先进AI能力的感知仍然有限。我们认为当前评估体系过于强调提升任务难度,未能充分涵盖广大人群日常工作、生活和学习活动所需的多样化智能体任务。为此,我们提出AgentIF-OneDay,旨在检验普通用户能否通过自然语言指令和智能体完成多样化的日常任务。这些任务不仅需要通过对话解决问题,还需理解多种附件类型并交付具体的文件成果。该基准围绕三个以用户为中心的类别构建:开放工作流执行——评估对显式复杂工作流的遵循能力;潜在指令——要求智能体从附件中推断隐含指令;迭代优化——涉及对进行中的工作进行修改或扩展。我们采用实例级评分规则和改进的评估流程,将基于大语言模型的验证与人工判断相结合,使用Gemini-3-Pro实现了80.1%的一致性。AgentIF-OneDay包含104个任务,覆盖767个评分点。我们对四个主流通用智能体进行基准测试,发现基于API构建的智能体产品与基于智能体强化学习的ChatGPT智能体同时处于第一梯队。领先的大语言模型API和开源模型已内化智能体能力,使AI应用团队能够开发前沿的智能体产品。