AcademiClaw: When Students Set Challenges for AI Agents

Junjie Yu,Pengrui Lu,Weiye Si,Hongliang Lu,Jiabao Wu,Kaiwen Tao,Kun Wang,Lingyu Yang,Qiran Zhang,Xiuting Guo,Xuanyu Wang,Yang Wang,Yanjie Wang,Yi Yang,Zijian Hu,Ziyi Yang,Zonghan Zhou,Binghao Qiang,Borui Zhang,Chenning Li,Enchang Zhang,Feifan Chen,Feng Jian,Fengyin Sun,Hao Qiu,Hao Zheng,Haoran Zhu,Hongyu Liu,Jianbin Deng,Jiaxin Song,Jiaying Chi,Jiayou Shi,Jie Fang,Jinghui Zhong,Jingyu Zhou,Jinze Li,Junfeng Yi,Junyan Yu,Junzhi Xue,Ni Song,Pengyi Chen,Qi Chen,Quansheng Li,Rui Tao,Shenghai Gong,Shenhang Lu,Tianqi Shen,Tianxiang Zhu,Tiehan Kang,Tingyu Li,Wendi Wu,Xiao Shen,Xiao Zhou,Xiaotao Zhang,Xinrong Li,Xuankun Yang,Xun Zhang,Yan Li,Ye Lu,Yi Wang,Yibo Zhou,Yichi Zhang,Yihao Sun,Yijun Huang,Yixin Zhu,Yixuan Wu,Yuchen Sun,Yue Wu,Yuheng Sun,Yukun Li,Yutian Tu,Yuxuan Qin,Yuzhuo Wu,Zeyu Li,Zhengyu Lou,Zhenning Ran,Zizhu He,Pengfei Liu

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

翻译：OpenClaw生态系统中的基准测试迄今为止仅评估了助手级任务，而OpenClaw的学术级能力在很大程度上仍未得到检验。我们引入了AcademiClaw，这是一个包含80个复杂、长视界任务的双语基准测试，这些任务直接来源于大学学生的真实学术工作流程——包括作业、研究项目、竞赛和个人项目——他们发现当前的AI智能体无法有效解决这些问题。经过严格专家评审，从230个学生提交的候选任务中筛选出最终任务集，涵盖25个以上专业领域，从奥林匹克级别的数学和语言学问题到GPU密集型的强化学习和全栈系统调试，其中16个任务需要CUDA GPU执行。每个任务在隔离的Docker沙箱中执行，并通过结合六种互补技术的多维评分标准对任务完成情况进行评分，同时辅以独立的五类别安全审计以提供额外的行为分析。在六个前沿模型上的实验表明，即使表现最好的模型也仅能达到55%的通过率。进一步分析揭示了跨任务领域的清晰能力边界、模型间不同的行为策略，以及token消耗与输出质量之间的脱节，提供了超越聚合指标所能揭示的细粒度诊断信号。我们希望AcademiClaw及其开源的数据和代码能够为OpenClaw社区提供有价值的资源，推动开发出在满足广泛真实学术需求方面更强大、更通用的智能体。所有数据和代码均可从 https://github.com/GAIR-NLP/AcademiClaw 获取。

相关内容

关注 7107

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

PaperOrchestra：一种面向自动化 AI 学术论文撰写的多智能体框架

专知会员服务

13+阅读 · 4月9日

如何画好论文框架图？北大谷歌发布PaperBanana：面向人工智能学者的学术论文绘图自动化系统

专知会员服务

18+阅读 · 2月5日

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日