The impressive performance of Large Language Models (LLMs) has consistently surpassed numerous human-designed benchmarks, presenting new challenges in assessing the shortcomings of LLMs. Designing tasks and finding LLMs' limitations are becoming increasingly important. In this paper, we investigate the question of whether an LLM can discover its own limitations from the errors it makes. To this end, we propose a Self-Challenge evaluation framework with human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances and incorporate human feedback on them to refine these patterns for generating more challenging data, iteratively. We end up with 8 diverse patterns, such as text manipulation and questions with assumptions. We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses. The SC-G4 serves as a challenging benchmark that allows for a detailed assessment of LLMs' abilities. Our results show that only 44.96\% of instances in SC-G4 can be answered correctly by GPT-4. Interestingly, our pilot study indicates that these error patterns also challenge other LLMs, such as Claude-3 and Llama-3, and cannot be fully resolved through fine-tuning. Our work takes the first step to demonstrate that LLMs can autonomously identify their inherent flaws and provide insights for future dynamic and automatic evaluation.
翻译:大语言模型(LLMs)的卓越性能已持续超越众多人为设计的基准测试,这为评估LLMs的缺陷带来了新的挑战。设计任务并发现LLMs的局限性正变得日益重要。本文研究了一个问题:LLM能否从其自身所犯的错误中发现其局限性?为此,我们提出了一种带有人类在环的自我挑战评估框架。从GPT-4无法回答的种子实例出发,我们提示GPT-4总结可用于生成新实例的错误模式,并引入人类对这些模式的反馈以迭代地精炼这些模式,从而生成更具挑战性的数据。最终我们得到了8种不同的模式,例如文本操作和包含假设的问题。随后,我们构建了一个名为SC-G4的基准测试,它包含1,835个由GPT-4使用这些模式生成的实例,并附有人工标注的标准答案。SC-G4作为一个具有挑战性的基准,允许对LLMs的能力进行详细评估。我们的结果表明,GPT-4仅能正确回答SC-G4中44.96%的实例。有趣的是,我们的初步研究表明,这些错误模式也能挑战其他LLMs,如Claude-3和Llama-3,并且无法通过微调完全解决。我们的工作迈出了第一步,证明了LLMs能够自主识别其固有缺陷,并为未来动态和自动化的评估提供了见解。