Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}.
翻译:近期对大型语言模型(LLM)的评估主要集中于测试其在基础自然语言任务中的零样本/少样本能力,以及将指令转化为工具API的能力。然而,针对LLM在复杂多模态环境中运用复杂工具完成多轮、多模态指令的评估尚未得到充分研究。为填补这一空白,我们提出了PowerPoint任务完成(PPTC)基准,用于评估LLM根据用户指令创建和编辑PPT文件的能力。该基准包含279个涵盖多种主题的多轮对话会话,以及涉及多模态操作的数百条指令。此外,我们提出了PPTX-Match评估系统,该系统基于预测文件而非标签API序列来评估LLM是否完成指令,因此可支持多种LLM生成的API序列。我们测试了3个闭源LLM和6个开源LLM。结果显示,GPT-4在单轮对话测试中以75.1%的准确率优于其他LLM,但在完成完整会话时面临挑战,会话准确率仅为6%。我们在基准中发现了三类主要错误原因:多轮会话中的错误累积、长PPT模板处理以及多模态感知能力不足。这些问题对未来LLM及智能体系统构成了重大挑战。我们将PPTC的数据、代码和评估系统开源在\url{https://github.com/gydpku/PPTC}。