In order to reduce the manual cost of designing ER models, recent approaches have been proposed to address the task of NL2ERM, i.e., automatically generating entity-relationship (ER) models from natural language (NL) utterances such as software requirements. These approaches are typically rule-based ones, which rely on rigid heuristic rules; these approaches cannot generalize well to various linguistic ways of describing the same requirement. Despite having better generalization capability than rule-based approaches, deep-learning-based models are lacking for NL2ERM due to lacking a large-scale dataset. To address this issue, in this paper, we report our insight that there exists a high similarity between the task of NL2ERM and the increasingly popular task of text-to-SQL, and propose a data transformation algorithm that transforms the existing data of text-to-SQL into the data of NL2ERM. We apply our data transformation algorithm on Spider, one of the most popular text-to-SQL datasets, and we also collect some data entries with different NL types, to obtain a large-scale NL2ERM dataset. Because NL2ERM can be seen as a special information extraction (IE) task, we train two state-of-the-art IE models on our dataset. The experimental results show that both the two models achieve high performance and outperform existing baselines.
翻译:为降低实体关系(ER)模型设计的人工成本,近年来学界提出了自然语言到实体关系模型(NL2ERM)的自动生成任务,即从软件需求等自然语言表述中自动生成实体关系模型。现有方法多基于刚性启发式规则,难以泛化至描述同一需求的多种语言表达方式。尽管深度学习模型具有比规则方法更强的泛化能力,但受限于大规模数据集的缺失,NL2ERM领域尚缺乏此类模型。针对该问题,本文揭示NL2ERM任务与日益流行的Text-to-SQL任务之间存在高度相似性,并提出一种数据变换算法,可将现有Text-to-SQL数据转换为NL2ERM数据。我们将该算法应用于最流行的Text-to-SQL数据集Spider,并采集不同自然语言类型的数据条目,最终构建大规模NL2ERM数据集。由于NL2ERM可视为特殊的信息抽取(IE)任务,我们在该数据集上训练了两个最先进的IE模型。实验结果表明,两个模型均取得优异性能,且超越现有基准方法。