The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking and defy default commonsense associations. We design a three-step procedure for creating the first lateral thinking benchmark, consisting of data collection, distractor generation, and generation of adversarial examples, leading to 1,100 puzzles with high-quality annotations. To assess the consistency of lateral reasoning by models, we enrich BRAINTEASER based on a semantic and contextual reconstruction of its questions. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance, which is further widened when consistency across adversarial formats is considered. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.
翻译:语言模型的成功激励自然语言处理社区关注那些依赖类人常识机制、需要隐含复杂推理的任务。尽管此类垂直思维任务已相对普及,横向思维谜题却鲜受关注。为弥补这一空白,我们设计了BRAINTEASER:一项旨在测试模型展现横向思维并打破默认常识关联能力的多项选择问答任务。我们构建了首个横向思维基准的三步流程,包括数据收集、干扰项生成及对抗样本生成,最终形成包含1,100道高质量标注谜题的数据集。为评估模型横向推理的一致性,我们基于问题的语义与语境重构对BRAINTEASER进行增强。采用最先进的指令型与常识型语言模型进行的实验揭示了人类与模型表现之间的显著差距,当考虑跨对抗格式的一致性时,该差距进一步扩大。我们公开全部代码与数据,以激励横向思维模型的开发与评估工作。