The growing dependence on Large Language Models (LLMs) for finishing user instructions necessitates a comprehensive understanding of their robustness to complex task completion in real-world situations. To address this critical need, we propose the PowerPoint Task Completion Robustness benchmark (PPTC-R) to measure LLMs' robustness to the user PPT task instruction and software version. Specifically, we construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. To assess the robustness of Language Models to software versions, we vary the number of provided APIs to simulate both the newest version and earlier version settings. Subsequently, we test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates these robustness settings, aiming to evaluate how deviations impact LLMs' API calls for task completion. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark, particularly in the version update and the multilingual settings. However, we find that all LLMs lose their robustness when confronted with multiple challenges (e.g., multi-turn) simultaneously, leading to significant performance drops. We further analyze the robustness behavior and error reasons of LLMs in our benchmark, which provide valuable insights for researchers to understand the LLM's robustness in task completion and develop more robust LLMs and agents. We release the code and data at \url{https://github.com/ZekaiGalaxy/PPTCR}.
翻译:随着大型语言模型(LLMs)在执行用户指令方面依赖程度的日益加深,亟需全面理解其在真实场景下完成复杂任务时的鲁棒性。为此,我们提出PPT任务完成鲁棒性基准(PPTC-R),用于衡量LLMs对用户PPT任务指令及软件版本的鲁棒性。具体而言,我们通过攻击用户指令的句子级、语义级和多语言级构建对抗性用户指令。为评估语言模型对软件版本的鲁棒性,我们通过动态调整所提供API的数量,模拟最新版本与早期版本两种配置。随后,我们利用集成上述鲁棒性设置的基准测试,评估3款闭源与4款开源LLMs,旨在分析指令偏差如何影响LLMs调用API完成任务的性能。研究发现,GPT-4在基准测试中展现出最高性能与强鲁棒性,尤其在版本更新与多语言场景下表现突出。然而,所有LLMs在同时面临多轮对话等多重挑战时,鲁棒性显著下降,导致性能大幅衰退。我们进一步分析了基准测试中LLMs的鲁棒性行为与错误成因,为研究者理解LLMs在任务完成中的鲁棒性、开发更鲁棒的LLMs及智能体提供重要洞见。相关代码与数据已开源至\url{https://github.com/ZekaiGalaxy/PPTCR}。