Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.

翻译：本文介绍了一种受智力测试启发的新型语言任务——黑鸟语言矩阵（BLM）任务，并详细阐述了BLM数据集的构建方法、基准测试流程，以及在组块化和系统性方面的针对性实验。BLM任务采用多层级结构的多选题形式：每个句子内部、输入序列之间、候选答案内部均存在结构化设计。由于其丰富的层次结构，这些经过人工筛选但保持自然语言特性的数据集成为探究当前大语言模型核心能力的关键工具：LLMs能否识别语言对象及其属性？它们是否能够检测并利用跨句子的系统性模式？它们更容易出现语言错误还是推理错误？这些错误如何相互作用？研究表明，BLM任务虽然具有挑战性，但可通过简单基线模型在多种语言中达到良好性能水平，若采用定制化模型则可获得更优表现。我们证明模型表征中包含了解决语言任务所需的语法对象与属性特征，同时揭示这些解决方案是通过检测跨句子的系统性模式实现的。本文论证了经过精心设计的结构化数据集能够支持对语言特性及大语言模型能力的多维度探究。由于BLM数据集具有人工构建的清晰层次结构，同时包含学习语境与预期答案，它们属于能够支持可解释性研究的数据集范畴，有助于探究大语言模型行为模式的成因。