With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. We find that nearly all LLMs struggle with employing lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial to an effective AI assistant.
翻译:随着大语言模型的持续演进与优化,它们被赋予了令人瞩目的逻辑推理或垂直思维能力。但这些模型能否跳出思维定势?它们是否具备熟练的横向思维能力?我们遵循横向思维谜题的设定,提出了一个新颖的评估基准——LatEval,该基准在交互式框架下评估模型的横向思维能力。在本基准中,我们从两个方面对大语言模型进行挑战:模型提出问题的质量以及模型整合信息解决问题的能力。我们发现,几乎所有的LLMs在交互过程中都难以运用横向思维。例如,即使是最先进的模型GPT-4,在一定程度上展现出优势,但与人类相比仍存在显著差距。这一评估基准为大语言模型提供了一个极具挑战性和独特性的任务,这对构建高效的人工智能助手至关重要。