With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the ability of LLMs is an open question. Existing evaluation methods suffer from following shortcomings: (1) constrained evaluation abilities, (2) vulnerable benchmarks, (3) unobjective metrics. We suggest that task-based evaluation, where LLM agents complete tasks in a simulated environment, is a one-for-all solution to solve above problems. We present AgentSims, an easy-to-use infrastructure for researchers from all disciplines to test the specific capacities they are interested in. Researchers can build their evaluation tasks by adding agents and buildings on an interactive GUI or deploy and test new support mechanisms, i.e. memory, planning and tool-use systems, by a few lines of codes. Our demo is available at https://agentsims.com .
翻译:随着ChatGPT类大语言模型在学术界广泛流行,如何评估LLM能力仍是一个开放性问题。现有评估方法存在以下缺陷:(1)评估能力受限,(2)基准测试易受攻击,(3)指标缺乏客观性。我们提出基于任务的评估方法——让LLM智能体在模拟环境中完成任务——作为解决上述问题的通用方案。本文呈现AgentSims,一个易于使用的基础设施,供不同学科的研究人员测试其关注的具体能力。研究人员可通过交互式图形界面添加智能体和建筑物来构建评估任务,或通过几行代码部署测试新型支持机制(如记忆、规划与工具使用系统)。我们的演示代码已发布在https://agentsims.com。