Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries given relational databases, which has been traditionally implemented in a cascaded manner while facing the following challenges: 1) model training is faced with the major issue of data scarcity, where limited parallel data is available; and 2) the systems should be robust enough to handle diverse out-of-domain speech samples that differ from the source data. In this work, we propose the first direct speech-to-SQL parsing model Wav2SQL which avoids error compounding across cascaded systems. Specifically, 1) to accelerate speech-driven SQL parsing research in the community, we release a large-scale and multi-speaker dataset MASpider; 2) leveraging the recent progress in the large-scale pre-training, we show that it alleviates the data scarcity issue and allow for direct speech-to-SQL parsing; and 3) we include the speech re-programming and gradient reversal classifier techniques to reduce acoustic variance and learned style-agnostic representation, improving generalization to unseen out-of-domain custom data. Experimental results demonstrate that Wav2SQL avoids error compounding and achieves state-of-the-art results by up to 2.5\% accuracy improvement over the baseline.
翻译:语音转SQL(S2SQL)旨在将口语问题转化为针对关系数据库的SQL查询,传统上采用级联方式实现,但面临以下挑战:1)模型训练面临数据稀缺的主要问题,可用并行数据有限;2)系统需具备足够鲁棒性以处理与源数据不同的多样化域外语音样本。本文提出首个直接语音转SQL解析模型Wav2SQL,避免了级联系统中的错误累积。具体而言:1)为加速社区中语音驱动SQL解析研究,我们发布了大规模多说话人数据集MASpider;2)借助大规模预训练的最新进展,证明其能缓解数据稀缺问题并实现直接语音转SQL解析;3)引入语音重编程和梯度反转分类器技术以降低声学方差并学习与风格无关的表示,提升对未见域外定制数据的泛化能力。实验结果表明,Wav2SQL避免了错误累积,在基线基础上实现了最高2.5%的准确率提升,达到当前最优结果。