Converting text into the structured query language (Text2SQL) is a research hotspot in the field of natural language processing (NLP), which has broad application prospects. In the era of big data, the use of databases has penetrated all walks of life, in which the collected data is large in scale, diverse in variety, and wide in scope, making the data query cumbersome and inefficient, and putting forward higher requirements for the Text2SQL model. In practical applications, the current mainstream end-to-end Text2SQL model is not only difficult to build due to its complex structure and high requirements for training data, but also difficult to adjust due to massive parameters. In addition, the accuracy of the model is hard to achieve the desired result. Based on this, this paper proposes a pipelined Text2SQL method: SPSQL. This method disassembles the Text2SQL task into four subtasks--table selection, column selection, SQL generation, and value filling, which can be converted into a text classification problem, a sequence labeling problem, and two text generation problems, respectively. Then, we construct data formats of different subtasks based on existing data and improve the accuracy of the overall model by improving the accuracy of each submodel. We also use the named entity recognition module and data augmentation to optimize the overall model. We construct the dataset based on the marketing business data of the State Grid Corporation of China. Experiments demonstrate our proposed method achieves the best performance compared with the end-to-end method and other pipeline methods.
翻译:将文本转换为结构化查询语言(Text2SQL)是自然语言处理(NLP)领域的研究热点,具有广阔的应用前景。在大数据时代,数据库的使用已渗透各行各业,其中采集的数据规模庞大、种类多样、范围广泛,导致数据查询繁琐低效,对Text2SQL模型提出了更高要求。在实际应用中,当前主流的端到端Text2SQL模型不仅因结构复杂且对训练数据要求高而难以构建,还因参数众多而难以调整。此外,模型精度难以达到理想效果。基于此,本文提出一种流水线式Text2SQL方法:SPSQL。该方法将Text2SQL任务拆解为四个子任务——表选择、列选择、SQL生成和值填充,可分别转化为文本分类问题、序列标注问题和两个文本生成问题。随后,我们基于现有数据构建不同子任务的数据格式,并通过提升每个子模型的精度来提高整体模型的精度。我们还利用命名实体识别模块和数据增强对整体模型进行优化。我们基于国家电网公司的营销业务数据构建数据集。实验表明,与端到端方法及其他流水线方法相比,我们提出的方法取得了最佳性能。