The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern large language models (LLMs). We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.
翻译:《纽约时报》每日发布的"联结"谜题要求玩家将十六个单词分为四组,每组包含四个与共同主题相关的单词。解决该谜题既需要通用语言知识(如词语定义及典型用法),更需要横向思维或抽象思维能力——因为四个类别按复杂度递增排列,最具挑战性的类别通常需要以非常规方式理解词语,或将其视为更大短语的组成部分。我们研究了自动化人工智能系统玩"联结"游戏的能力,并探索该游戏作为抽象推理自动化基准,以及衡量数据驱动语言系统编码语义信息程度的潜在价值。具体而言,我们同时研究了句子嵌入基准模型和现代大型语言模型,报告了它们在任务中的准确率,测量了思维链提示的影响,并讨论了其失败模式。总体而言,我们发现"联结"任务兼具挑战性与可行性,是未来研究的有力试验平台。