Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}.
翻译:近期对大型语言模型(LLMs)的评估主要聚焦于其在基本自然语言任务中的零样本/少样本能力,以及将指令转化为工具API的能力。然而,针对LLMs在复杂多模态环境中利用复杂工具完成多轮、多模态指令的评估尚未得到充分研究。为填补这一空白,我们提出了PowerPoint任务完成基准(PPTC),用于评估LLMs根据用户指令创建和编辑PPT文件的能力。该基准包含覆盖不同主题的279个多轮对话会话及数百条涉及多模态操作的指令。我们还提出了PPTX-Match评估系统,该系统基于预测文件而非标签API序列评估LLMs是否完成指令,因此支持多种LLM生成的API序列。我们对3种闭源LLMs和6种开源LLMs进行了测试。结果显示,GPT-4在单轮对话测试中以75.1%的准确率优于其他LLMs,但在完成整个会话时面临挑战,会话完成准确率仅为6%。我们发现了基准测试中的三类主要错误原因:多轮会话中的错误累积、长PPT模板处理及多模态感知问题。这些对未来的LLM和智能体系统构成了重大挑战。PPTC的数据、代码及评估系统已发布于\url{https://github.com/gydpku/PPTC}。