The Two Word Test: A Semantic Benchmark for Large Language Models

Large Language Models (LLMs) have shown remarkable abilities recently, including passing advanced professional exams and demanding benchmark tests. This performance has led many to suggest that they are close to achieving humanlike or 'true' understanding of language, and even Artificial General Intelligence (AGI). Here, we provide a new open-source benchmark that can assess semantic abilities of LLMs using two-word phrases using a task that can be performed relatively easily by humans without advanced training. Combining multiple words into a single concept is a fundamental aspect of human language and intelligence. The test requires meaningfulness judgments of 1768 noun-noun combinations that have been rated as meaningful (e.g., baby boy) or not meaningful (e.g., goat sky). by 150 human raters. We provide versions of the task that probe meaningfulness ratings on a 0-4 scale as well as binary judgments. We conducted a series of experiments using the TWT on GPT-4, GPT-3.5, and Bard, with both versions. Results demonstrated that, compared to humans, all models perform poorly at rating meaningfulness of these phrases. GPT-3.5 and Bard are also unable to make binary discriminations between sensible and nonsense phrases as making sense. GPT-4 makes a substantial improvement in binary discrimination of combinatorial phrases but is still significantly worse than human performance. The TWT can be used to understand the limitations and weaknesses of current LLMs, and potentially improve them. The test also reminds us that caution is warranted in attributing 'true understanding' or AGI to LLMs. TWT is available at: https://github.com/NickRiccardi/two-word-test

翻译：大型语言模型（LLMs）近期展现出卓越能力，包括通过高级专业考试和严苛的基准测试。这些表现使许多人认为它们已接近实现类人或“真正”的语言理解，甚至达到通用人工智能（AGI）水平。本文提出一种新的开源基准，通过两项词汇短语评估LLMs的语义能力，该任务无需高级训练即可由人类相对轻松完成。将多个词汇组合成单一概念是人类语言与智能的基本特征。该测试要求对1768个经150名人类评分者标注为有意义（如“baby boy”）或无意义（如“goat sky”）的名词-名词组合进行意义性判断。我们提供了任务的两个版本：0-4分量表的意义性评分以及二元判断。我们使用TWT对GPT-4、GPT-3.5和Bard进行了两个版本的系列实验。结果表明，与人类相比，所有模型在评估这些短语的意义性方面均表现欠佳。GPT-3.5和Bard在区分合理与无意义短语的二元判断上也存在困难。GPT-4在组合短语的二元判别方面有显著提升，但仍远低于人类表现。TWT可用于理解当前LLMs的局限与弱点，并可能推动其改进。该测试也提醒我们，在将“真正理解”或AGI归因于LLMs时应保持审慎。TWT详见：https://github.com/NickRiccardi/two-word-test