With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. We find that nearly all LLMs struggle with employing lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial to an effective AI assistant.
翻译:随着大语言模型的持续演进与优化,它们被赋予了令人印象深刻的逻辑推理或垂直思维能力。但模型能否突破常规思维框架?是否具备熟练的横向思维能力?遵循横向思维谜题的设定,我们提出了一种新颖的评估基准——LatEval,该基准在交互框架下评估模型的横向思维能力。在本基准中,我们从两个维度对大语言模型发起挑战:模型提出问题的质量,以及模型整合信息进行问题求解的能力。研究发现,几乎所有大语言模型在交互过程中都难以运用横向思维。例如,即便最先进的模型GPT-4展现出一定优势,其与人类水平之间仍存在明显差距。该评估基准为大语言模型提供了一项极具挑战性与独特性的任务,这对于构建高效的人工智能助手至关重要。