Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies

Large language models (LLMs) have shown remarkable performance across various sentence-based linguistic phenomena, yet their ability to capture cross-sentence paradigmatic patterns, such as verb alternations, remains underexplored. In this work, we present curated paradigm-based datasets for four languages, designed to probe systematic cross-sentence knowledge of verb alternations (change-of-state and object-drop constructions in English, German and Italian, and Hebrew binyanim). The datasets comprise thousands of the Blackbird Language Matrices (BLMs) problems. The BLM task -- an RPM/ARC-like task devised specifically for language -- is a controlled linguistic puzzle where models must select the sentence that completes a pattern according to syntactic and semantic rules. We introduce three types of templates varying in complexity and apply linguistically-informed data augmentation strategies across synthetic and natural data. We provide simple baseline performance results across English, Italian, German, and Hebrew, that demonstrate the diagnostic usefulness of the datasets.

翻译：大语言模型（LLMs）在各类基于句子的语言现象中展现出卓越性能，但其捕捉跨句子范式模式（如动词交替）的能力仍有待深入探索。本研究针对四种语言构建了基于范式的精选数据集，旨在系统探究动词交替的跨句子知识（涵盖英语、德语和意大利语的状态变化与宾语省略结构，以及希伯来语的动词派生形态）。数据集包含数千个黑鸟语言矩阵（BLMs）问题。BLM任务——一种专为语言设计的类RPM/ARC任务——是一种受控的语言谜题，要求模型必须根据句法和语义规则选择符合模式的句子。我们引入了三种复杂度不同的模板类型，并在合成数据与自然数据上应用了基于语言学理论的数据增强策略。通过提供英语、意大利语、德语和希伯来语的简单基线性能结果，验证了这些数据集的诊断价值。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

赋能大型语言模型多领域资源挑战

专知会员服务

10+阅读 · 2025年6月10日

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向

专知会员服务

46+阅读 · 2025年4月26日

【新书】使用大型语言模型进行数据分析：文本、表格、图像与音频

专知会员服务

43+阅读 · 2025年4月16日

如何将领域知识注入大模型？最新《将领域特定知识注入大语言模型》综述

专知会员服务

79+阅读 · 2025年2月24日