Spoken semantic parsing (SSP) involves generating machine-comprehensible parses from input speech. Training robust models for existing application domains represented in training data or extending to new domains requires corresponding triplets of speech-transcript-semantic parse data, which is expensive to obtain. In this paper, we address this challenge by examining methods that can use transcript-semantic parse data (unpaired text) without corresponding speech. First, when unpaired text is drawn from existing textual corpora, Joint Audio Text (JAT) and Text-to-Speech (TTS) are compared as ways to generate speech representations for unpaired text. Experiments on the STOP dataset show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively. Second, we consider the setting when unpaired text is not available in existing textual corpora. We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains. Experiments show that examples and words that co-occur with intents can be used to generate unpaired text with Llama 2.0. Using the generated text with JAT and TTS for spoken semantic parsing improves EM on STOP by 1.4% and 2.6% absolute for existing and new domains respectively.
翻译:口语语义解析(SSP)涉及从输入语音中生成机器可理解的解析结果。为训练数据中已有的应用领域训练稳健模型,或将模型扩展至新领域,通常需要对应的语音-文本-语义解析三元组数据,但此类数据的获取成本高昂。本文通过探索可在无对应语音情况下仅使用文本-语义解析数据(非配对文本)的方法来应对这一挑战。首先,当非配对文本来源于现有文本语料库时,我们比较了联合音频文本(JAT)与文本转语音(TTS)两种方法,以生成非配对文本的语音表示。在STOP数据集上的实验表明,来自现有领域和新领域的非配对文本分别使绝对精确匹配率(EM)提升了2%和30%。其次,我们考虑了现有文本语料库中无可获取的非配对文本的情况。我们提出提示大语言模型(LLM)为现有领域和新领域生成非配对文本。实验表明,可利用与意图共现的示例和词汇,通过Llama 2.0生成非配对文本。将这些生成文本与JAT及TTS结合用于口语语义解析,在STOP数据集上,现有领域和新领域的EM分别绝对提升了1.4%和2.6%。