End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit{zero-shot} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire speech-text corpus from any domains leads to \textit{imbalance} and \textit{noise} issues. To address these, we propose \textit{cross-modal selective self-training} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also introduce two benchmarks for zero-shot E2E SLU, covering matched and found speech (mismatched) settings. Experiments show that CMSST improves performance in both two settings, with significantly reduced sample sizes and training time.
翻译:端到端口语理解(E2E SLU)受限于语音-语义对数据采集成本,尤其在标签领域发生变化时。因此,我们探索零样本E2E SLU,该方法无需语音-语义对即可学习E2E SLU,仅需利用语音-文本对与文本-语义对。先前工作通过利用文本-语义语料库训练的自然语言理解(NLU)模型,对所有语音-文本转录进行伪标签化来实现零样本学习。然而,该方法要求语音-文本与文本-语义的领域保持一致,而由于数据采集分离,两者常出现领域不匹配。此外,使用任意领域的完整语音-文本语料库会引发数据不平衡与标签噪声问题。为解决这些挑战,我们提出跨模态选择性自训练(CMSST)。CMSST通过在三模态(语音、文本与语义)联合空间中进行聚类处理不平衡问题,并利用选择网络处理标签噪声。我们还为零样本E2E SLU引入了两个基准测试,涵盖匹配语音与发现语音(不匹配)两种场景。实验表明,CMSST在两种场景下均能提升模型性能,同时显著降低样本需求与训练时间。