Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
翻译:利用纯文本数据将基于大语言模型(LLM)的自动语音识别(ASR)系统适配至新领域,是一个重要但尚未被充分探索的挑战。在目标领域文本上对LLM进行标准微调,往往会破坏投影器学习到的语音与文本模态间的关键对齐,导致性能下降。我们提出了一种新颖的纯文本适配方法,该方法通过将音频投影任务视为文本去噪任务来模拟之。因此,我们的方法训练LLM从含噪输入中恢复干净的转录文本。此过程在将模型有效适配至目标领域的同时,保持了跨模态对齐。我们的解决方案是轻量级的,无需改变架构或增加额外参数。在两个数据集上的广泛评估表明,该方法实现了高达22.1%的相对性能提升,优于近期最先进的纯文本适配方法。