DORE: A Dataset For Portuguese Definition Generation

Definition modelling (DM) is the task of automatically generating a dictionary definition for a specific word. Computational systems that are capable of DM can have numerous applications benefiting a wide range of audiences. As DM is considered a supervised natural language generation problem, these systems require large annotated datasets to train the machine learning (ML) models. Several DM datasets have been released for English and other high-resource languages. While Portuguese is considered a mid/high-resource language in most natural language processing tasks and is spoken by more than 200 million native speakers, there is no DM dataset available for Portuguese. In this research, we fill this gap by introducing DORE; the first dataset for Definition MOdelling for PoRtuguEse containing more than 100,000 definitions. We also evaluate several deep learning based DM models on DORE and report the results. The dataset and the findings of this paper will facilitate research and study of Portuguese in wider contexts.

翻译：定义建模（DM）是指为特定词汇自动生成词典定义的任务。具备DM能力的计算系统可产生众多应用，惠及广泛受众。由于DM被视为有监督的自然语言生成问题，此类系统需要大规模标注数据集来训练机器学习（ML）模型。目前已针对英语及其他高资源语言发布了多个DM数据集。尽管葡萄牙语在大多数自然语言处理任务中被视为中/高资源语言，且拥有超过2亿母语使用者，但目前尚缺乏可用的葡萄牙语DM数据集。本研究通过推出DORE——首个面向葡萄牙语的定义建模数据集（含超过10万个定义）来填补这一空白。我们还基于DORE评估了多种基于深度学习的DM模型并报告了结果。本数据集及研究发现将有助于在更广泛背景下推进葡萄牙语的研究工作。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日