Much recent effort has been devoted to creating large-scale language models. Nowadays, the most prominent approaches are based on deep neural networks, such as BERT. However, they lack transparency and interpretability, and are often seen as black boxes. This affects not only their applicability in downstream tasks but also the comparability of different architectures or even of the same model trained using different corpora or hyperparameters. In this paper, we propose a set of intrinsic evaluation tasks that inspect the linguistic information encoded in models developed for Brazilian Portuguese. These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions (MWEs), thus allowing for an assessment of whether the model has learned different linguistic phenomena. The dataset that was developed for these tasks is composed of a series of sentences with a single masked word and a cue phrase that helps in narrowing down the context. This dataset is divided into MWEs and grammatical structures, and the latter is subdivided into 6 tasks: impersonal verbs, subject agreement, verb agreement, nominal agreement, passive and connectors. The subset for MWEs was used to test BERTimbau Large, BERTimbau Base and mBERT. For the grammatical structures, we used only BERTimbau Large, because it yielded the best results in the MWE task.
翻译:近年来,大量研究致力于构建大规模语言模型。目前,最突出的方法基于深度神经网络,如BERT。然而,这些模型缺乏透明度和可解释性,常被视为黑箱。这不仅影响其在下游任务中的应用能力,也阻碍了不同架构或同一模型在不同语料库或超参数下训练结果的可比性。本文提出一组内在评估任务,用于检测专为巴西葡萄牙语开发的模型中编码的语言信息。这些任务旨在评估不同语言模型如何泛化与语法结构和多词表达(MWEs)相关的信息,从而判断模型是否习得了不同的语言现象。为此,我们开发了一个数据集,包含一系列含单个掩码词的句子及一个有助于缩小上下文范围的提示短语。该数据集分为MWEs和语法结构两部分,后者进一步细分为6项任务:非人称动词、主谓一致、动词一致、名词一致、被动语态和连接词。MWEs子集用于测试BERTimbau Large、BERTimbau Base和mBERT。对于语法结构,我们仅使用BERTimbau Large,因为它在MWE任务中表现最佳。