Large language models (LLMs) have received increasing attention. However, due to the complexity of its capabilities, how to rationally evaluate the capabilities of LLMs is still a task to be solved. We propose the RoCar method, which utilizes the defined basic schemas to randomly construct a task graph and generates natural language evaluation tasks based on the task graph to evaluate the reasoning and memory abilities of LLMs respectively. Due to the very large randomness of the task construction process, it is possible to ensure that none of the LLMs to be tested has directly learned the evaluation tasks, guaranteeing the fairness of the evaluation method.
翻译:大语言模型(LLMs)受到越来越多的关注。然而,由于其能力的复杂性,如何合理评估LLMs的能力仍是一个有待解决的任务。我们提出了RoCar方法,该方法利用定义的基本模式随机构建任务图,并基于任务图生成自然语言评估任务,以分别评估LLMs的推理和记忆能力。由于任务构建过程具有极大的随机性,可以确保所有待测LLMs均未直接学习过评估任务,从而保证了评估方法的公平性。