The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking and defy default commonsense associations. We design a three-step procedure for creating the first lateral thinking benchmark, consisting of data collection, distractor generation, and generation of adversarial examples, leading to 1,100 puzzles with high-quality annotations. To assess the consistency of lateral reasoning by models, we enrich BRAINTEASER based on a semantic and contextual reconstruction of its questions. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance, which is further widened when consistency across adversarial formats is considered. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.
翻译:语言模型取得的成功激励了自然语言处理社区关注那些依赖类人常识机制、需要隐式复杂推理的任务。尽管此类纵向思维任务已相对普及,但横向思维谜题却鲜受关注。为弥合这一差距,我们设计了BRAINTEASER:一个旨在测试模型展现横向思维能力、打破默认常识关联的多选题问答任务。我们设计了三阶段流程来构建首个横向思维基准测试,涵盖数据收集、干扰项生成及对抗样本生成,最终得到1,100道带高质量注释的谜题。为评估模型横向推理的一致性,我们基于问题的语义和语境重构对BRAINTEASER进行了增强。我们与最先进的指令与常识语言模型进行的实验表明,人类与模型性能之间存在显著差距,而当考虑对抗格式下的一致性时,这一差距进一步扩大。我们开放所有代码与数据,以激励横向思维模型的开发与评估工作。