CTC compressor can be an effective approach to integrate audio encoders to decoder-only models, which has gained growing interest for different speech applications. In this work, we propose a novel CTC compressor based joint speech and text training (CJST) framework for decoder-only ASR. CJST matches speech and text modalities from both directions by exploring a simple modality adaptor and several features of the CTC compressor, including sequence compression, on-the-fly forced peaky alignment and CTC class embeddings. Experimental results on the Librispeech and TED-LIUM2 corpora show that the proposed CJST achieves an effective text injection without the need of duration handling, leading to the best performance for both in-domain and cross-domain scenarios. We also provide a comprehensive study on CTC compressor, covering various compression modes, edge case handling and behavior under both clean and noisy data conditions, which reveals the most robust setting to use CTC compressor for decoder-only models.
翻译:CTC压缩器可作为将音频编码器集成至仅解码器模型的有效方法,该方法在不同语音应用中日益受到关注。本文针对仅解码器语音识别系统,提出一种新颖的基于CTC压缩器的联合语音与文本训练(CJST)框架。CJST通过探索简单的模态适配器及CTC压缩器的多项特性——包括序列压缩、动态强制峰值对齐与CTC类别嵌入——实现语音与文本模态的双向匹配。在Librispeech与TED-LIUM2语料库上的实验结果表明,所提出的CJST框架无需时长处理即可实现有效的文本注入,在领域内与跨领域场景中均取得最优性能。本文还对CTC压缩器进行了系统性研究,涵盖多种压缩模式、边界情况处理以及在纯净与含噪数据条件下的行为分析,揭示了将CTC压缩器应用于仅解码器模型的最鲁棒配置方案。