Definition modelling (DM) is the task of automatically generating a dictionary definition for a specific word. Computational systems that are capable of DM can have numerous applications benefiting a wide range of audiences. As DM is considered a supervised natural language generation problem, these systems require large annotated datasets to train the machine learning (ML) models. Several DM datasets have been released for English and other high-resource languages. While Portuguese is considered a mid/high-resource language in most natural language processing tasks and is spoken by more than 200 million native speakers, there is no DM dataset available for Portuguese. In this research, we fill this gap by introducing DORE; the first dataset for Definition MOdelling for PoRtuguEse containing more than 100,000 definitions. We also evaluate several deep learning based DM models on DORE and report the results. The dataset and the findings of this paper will facilitate research and study of Portuguese in wider contexts.
翻译:定义建模(DM)是指为特定词汇自动生成词典定义的任务。具备DM能力的计算系统可产生众多应用,惠及广泛受众。由于DM被视为有监督的自然语言生成问题,此类系统需要大规模标注数据集来训练机器学习(ML)模型。目前已针对英语及其他高资源语言发布了多个DM数据集。尽管葡萄牙语在大多数自然语言处理任务中被视为中/高资源语言,且拥有超过2亿母语使用者,但目前尚缺乏可用的葡萄牙语DM数据集。本研究通过推出DORE——首个面向葡萄牙语的定义建模数据集(含超过10万个定义)来填补这一空白。我们还基于DORE评估了多种基于深度学习的DM模型并报告了结果。本数据集及研究发现将有助于在更广泛背景下推进葡萄牙语的研究工作。