Much recent effort has been devoted to creating large-scale language models. Nowadays, the most prominent approaches are based on deep neural networks, such as BERT. However, they lack transparency and interpretability, and are often seen as black boxes. This affects not only their applicability in downstream tasks but also the comparability of different architectures or even of the same model trained using different corpora or hyperparameters. In this paper, we propose a set of intrinsic evaluation tasks that inspect the linguistic information encoded in models developed for Brazilian Portuguese. These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions (MWEs), thus allowing for an assessment of whether the model has learned different linguistic phenomena. The dataset that was developed for these tasks is composed of a series of sentences with a single masked word and a cue phrase that helps in narrowing down the context. This dataset is divided into MWEs and grammatical structures, and the latter is subdivided into 6 tasks: impersonal verbs, subject agreement, verb agreement, nominal agreement, passive and connectors. The subset for MWEs was used to test BERTimbau Large, BERTimbau Base and mBERT. For the grammatical structures, we used only BERTimbau Large, because it yielded the best results in the MWE task.
翻译:近年来,大量研究致力于构建大规模语言模型。目前,最显著的方法基于深度神经网络,如BERT。然而,这些模型缺乏透明度和可解释性,常被视为黑箱。这不仅影响它们在下游任务中的适用性,也阻碍了不同架构之间、甚至同一模型在不同语料或超参数下训练结果的可比性。本文提出一组内在评估任务,用于检验为巴西葡萄牙语开发的模型所编码的语言信息。这些任务旨在评估不同语言模型在语法结构和多词表达(MWE)相关信息上的泛化能力,从而判断模型是否学习了不同的语言现象。为此任务开发的数据集由一系列句子组成,每个句子包含一个被掩蔽的词以及一个有助于缩小上下文的提示短语。该数据集分为多词表达和语法结构两部分,后者细分为6个任务:无人称动词、主语一致、动词一致、名词一致、被动语态和连接词。多词表达子集用于测试BERTimbau Large、BERTimbau Base和mBERT。对于语法结构,我们仅使用了BERTimbau Large,因为它在多词表达任务中取得了最佳结果。