Cryptic crosswords are puzzles that rely on general knowledge and the solver's ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance. We release our code and introduced datasets at https://github.com/bodasadallah/decrypting-crosswords.
翻译:解谜填字游戏是一种依赖常识及解题者在不同层面上操纵语言能力的谜题,涉及多种文字游戏类型。先前研究表明,即使对于包括大型语言模型在内的现代自然语言处理模型而言,解决此类谜题仍具挑战性。然而,关于其在此任务上表现欠佳原因的研究却极为匮乏。本文针对三种主流大型语言模型——Gemma2、LLaMA3与ChatGPT——建立了基准测试结果,表明它们在此任务上的表现仍显著低于人类水平。同时,我们深入探究了这些模型难以实现优异性能的内在原因。我们已在https://github.com/bodasadallah/decrypting-crosswords公开相关代码与构建的数据集。