The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.
翻译:大语言模型(LLM)代理在计算机自动化部署中的应用正加速发展,但其驾驭复杂、专业级生产力软件的能力仍缺乏系统检验。我们认为,办公自动化是评估文档自动化能力的理想环境,因其要求长期规划与推理、精确参数配置及多应用集成。为量化该能力,我们基于中国计算机等级考试(NCRE)提出评估方案,涵盖Word、Excel、PowerPoint的200项综合性实操任务。每项任务采用百分制评分体系(含7118个机器可评分的细则指标),得分率(SR)表示所有任务中细则得分的平均百分比。我们对7个前沿LLM进行基准测试,发现存在显著局限:单轮交互模型最高得分为36.6%。融合执行反馈、迭代修复及更广泛办公自动化访问权限的增强型代理系统达到68.8%,但仍低于作为评分合理性校验的95.5%社区参考得分。最终实验表明,尽管代码生成技术近期取得进展,对当前代码生成型LLM及代理系统而言,实现可靠的细粒度办公文档自动化仍是一大严峻挑战。