The proliferation of Large Language Models like ChatGPT has significantly advanced language understanding and generation, impacting a broad spectrum of applications. However, these models predominantly excel in text-based tasks, overlooking the complexity of real-world multimodal information. This study introduces MultiAPI, a pioneering comprehensive large-scale API benchmark dataset aimed at expanding LLMs' proficiency in multimodal contexts. Developed collaboratively through ChatGPT, MultiAPI consists of 235 diverse API calls and 2,038 contextual prompts, offering a unique platform evaluation of tool-augmented LLMs handling multimodal tasks. Through comprehensive experiments, our findings reveal that while LLMs demonstrate proficiency in API call decision-making, they face challenges in domain identification, function selection, and argument generation. What's more, we surprisingly notice that auxiliary context can actually impair the performance. An in-depth error analysis paves the way for a new paradigm to address these challenges, suggesting a potential direction for future LLM research.
翻译:随着ChatGPT等大语言模型的普及,其在语言理解与生成领域取得了显著进展,深刻影响了广泛的应用场景。然而,这些模型主要在文本任务中表现出色,忽略了真实世界多模态信息的复杂性。本研究提出MultiAPI——首个综合性大规模API基准数据集,旨在拓展大语言模型在多模态场景下的能力。MultiAPI通过ChatGPT协作开发,包含235个多样化的API调用和2038个上下文提示,为评估工具增强型大语言模型处理多模态任务提供了独特平台。全面实验表明,尽管大语言模型在API调用决策中展现出熟练度,但在领域识别、函数选择和参数生成方面仍面临挑战。更值得注意的是,我们意外发现辅助上下文可能反而削弱模型性能。深入误差分析为应对这些挑战指明了新范式,为未来大语言模型研究提供了潜在方向。