Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool-utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce T-Eval to evaluate the tool utilization capability step by step. T-Eval disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of various LLMs. T-Eval not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability. The benchmark will be available at https://github.com/open-compass/T-Eval.
翻译:大语言模型(LLM)已在各类自然语言处理任务上取得显著性能,并通过工具增强应用于更广泛的场景。然而,如何评估并分析LLM的工具利用能力仍鲜有探索。与以往整体性评估模型的研究不同,我们将工具利用能力全面分解为多个子流程,包括指令遵循、规划、推理、检索、理解与审查。基于此,我们进一步提出T-Eval,以逐步评估工具利用能力。T-Eval将工具利用评估沿模型能力维度拆解为若干子领域,从而促进对LLM整体及独立能力的深入理解。我们在T-Eval上开展了大量实验,并对各类LLM进行了深入分析。T-Eval不仅与结果导向的评估保持一致性,还提供了更细粒度的LLM能力分析,为评估LLM工具利用能力提供了新视角。该基准测试将在https://github.com/open-compass/T-Eval开放获取。