Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance. In this paper, we expandour understanding of the resulting shared speech-text representationswith two types of analyses. First we examine the limits of speech-free domain adaptation, finding that a corpus-specific duration modelfor speech-text alignment is the most important component for learn-ing a shared speech-text representation. Second, we inspect the sim-ilarities between activations of unimodal (speech or text) encodersas compared to the activations of a shared encoder. We find that theshared encoder learns a more compact and overlapping speech-textrepresentation than the uni-modal encoders. We hypothesize that thispartially explains the effectiveness of the Maestro shared speech-textrepresentations.
翻译:近期,学界开发了多种通过将文本融入端到端模型来训练语音模型的方法,其中Maestro推动了自动语音识别(ASR)和语音翻译(ST)性能的突破。本文通过两种分析方式拓展我们对共享语音-文本表示的理解。首先,我们考察了无语音领域适配的边界,发现面向特定语料库的语音-文本对齐时长模型是学习共享语音-文本表示的最关键组件。其次,我们比较了单模态(语音或文本)编码器的激活模式与共享编码器的激活模式,发现共享编码器学习到了比单模态编码器更紧凑且重叠性更强的语音-文本表示。我们假设这在一定程度上解释了Maestro共享语音-文本表示的有效性。