Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.
翻译:OpenClaw生态系统中的基准测试迄今为止仅评估了助手级任务,而OpenClaw的学术级能力在很大程度上仍未得到检验。我们引入了AcademiClaw,这是一个包含80个复杂、长视界任务的双语基准测试,这些任务直接来源于大学学生的真实学术工作流程——包括作业、研究项目、竞赛和个人项目——他们发现当前的AI智能体无法有效解决这些问题。经过严格专家评审,从230个学生提交的候选任务中筛选出最终任务集,涵盖25个以上专业领域,从奥林匹克级别的数学和语言学问题到GPU密集型的强化学习和全栈系统调试,其中16个任务需要CUDA GPU执行。每个任务在隔离的Docker沙箱中执行,并通过结合六种互补技术的多维评分标准对任务完成情况进行评分,同时辅以独立的五类别安全审计以提供额外的行为分析。在六个前沿模型上的实验表明,即使表现最好的模型也仅能达到55%的通过率。进一步分析揭示了跨任务领域的清晰能力边界、模型间不同的行为策略,以及token消耗与输出质量之间的脱节,提供了超越聚合指标所能揭示的细粒度诊断信号。我们希望AcademiClaw及其开源的数据和代码能够为OpenClaw社区提供有价值的资源,推动开发出在满足广泛真实学术需求方面更强大、更通用的智能体。所有数据和代码均可从 https://github.com/GAIR-NLP/AcademiClaw 获取。