Self-supervised learning (SSL) models have shown exceptional capabilities across various speech-processing tasks. Continuous SSL representations are effective but suffer from high computational and storage demands. On the other hand, discrete SSL representations, although with degraded performance, reduce transmission and storage costs, and improve input sequence efficiency through de-duplication and subword-modeling. To boost the performance of discrete representations for ASR, we introduce a novel fusion mechanism that integrates two discrete representations. The fusion mechanism preserves all the benefits of discrete representation while enhancing the model's performance by integrating complementary information. Additionally, we explore "self-augmented'' discrete representations, which apply transformations to a single continuous SSL representation, eliminating the fusion mechanism's dependency on multiple SSL models and further decreasing its inference costs. Experimental results on benchmarks, including LibriSpeech and ML-SUPERB, indicate up to 19% and 24% relative character error rate improvement compared with the non-fusion baseline, validating the effectiveness of our proposed methods.
翻译:自监督学习(SSL)模型在各种语音处理任务中展现出卓越能力。连续SSL表示虽有效,但存在计算与存储需求高的问题。另一方面,离散SSL表示虽性能有所下降,却能降低传输与存储成本,并通过去重和子词建模提高输入序列效率。为提升离散表示在自动语音识别中的性能,我们提出一种融合两种离散表示的新机制。该融合机制在保留离散表示全部优势的同时,通过整合互补信息提升模型性能。此外,我们探索了“自增强”离散表示,通过对单一连续SSL表示进行变换,消除了融合机制对多个SSL模型的依赖,进一步降低了推理成本。在LibriSpeech和ML-SUPERB等基准测试上的实验结果表明,相比非融合基线,字符错误率相对降低最高达19%和24%,验证了所提方法的有效性。