With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. We find that nearly all LLMs struggle with employing lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial to an effective AI assistant.
翻译:随着大语言模型的持续演进与优化,它们被赋予了令人印象深刻的逻辑推理或纵向思维能力。但模型能否跳出思维定式?是否具备熟练的横向思维能力?遵循侧向思维谜题的设定,我们提出了一种新颖的评估基准——LatEval,该基准在交互式框架内评估模型的横向思维能力。在本基准中,我们从两个维度对大语言模型提出挑战:模型所提问题的质量,以及模型整合信息以解决问题的能力。我们发现,几乎所有大语言模型在交互过程中都难以运用横向思维。例如,即便是最先进的模型GPT-4,虽展现出一定优势,但与人类相比仍存在显著差距。这一评估基准为大语言模型提供了一项极具挑战性和独特性的任务,这对于构建高效的人工智能助手至关重要。