Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese

from arxiv, This is the original manuscript that was submitted to LREV. The final version was published recently and can be found at: https://rdcu.be/ddEa6. Language Resources and Evaluation, https://doi.org/10.1007/s10579-023-09664-1

Much recent effort has been devoted to creating large-scale language models. Nowadays, the most prominent approaches are based on deep neural networks, such as BERT. However, they lack transparency and interpretability, and are often seen as black boxes. This affects not only their applicability in downstream tasks but also the comparability of different architectures or even of the same model trained using different corpora or hyperparameters. In this paper, we propose a set of intrinsic evaluation tasks that inspect the linguistic information encoded in models developed for Brazilian Portuguese. These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions (MWEs), thus allowing for an assessment of whether the model has learned different linguistic phenomena. The dataset that was developed for these tasks is composed of a series of sentences with a single masked word and a cue phrase that helps in narrowing down the context. This dataset is divided into MWEs and grammatical structures, and the latter is subdivided into 6 tasks: impersonal verbs, subject agreement, verb agreement, nominal agreement, passive and connectors. The subset for MWEs was used to test BERTimbau Large, BERTimbau Base and mBERT. For the grammatical structures, we used only BERTimbau Large, because it yielded the best results in the MWE task.

翻译：近年来，大量研究致力于构建大规模语言模型。目前，最显著的方法基于深度神经网络，如BERT。然而，这些模型缺乏透明度和可解释性，常被视为黑箱。这不仅影响它们在下游任务中的适用性，也阻碍了不同架构之间、甚至同一模型在不同语料或超参数下训练结果的可比性。本文提出一组内在评估任务，用于检验为巴西葡萄牙语开发的模型所编码的语言信息。这些任务旨在评估不同语言模型在语法结构和多词表达（MWE）相关信息上的泛化能力，从而判断模型是否学习了不同的语言现象。为此任务开发的数据集由一系列句子组成，每个句子包含一个被掩蔽的词以及一个有助于缩小上下文的提示短语。该数据集分为多词表达和语法结构两部分，后者细分为6个任务：无人称动词、主语一致、动词一致、名词一致、被动语态和连接词。多词表达子集用于测试BERTimbau Large、BERTimbau Base和mBERT。对于语法结构，我们仅使用了BERTimbau Large，因为它在多词表达任务中取得了最佳结果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/