Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowledge transfer from attention matrices of models rooted in high resource modalities, text and images, to those in resource-constrained domains, speech and audio, employing a zero-shot paradigm. MAM reduces the relative Word Error Rate (WER) of an Automatic Speech Recognition (ASR) model by up to 6.70%, and relative classification error of an Audio Event Classification (AEC) model by 10.63%. In cases where some data/compute is available, we present Learnable-MAM, a data-driven approach to merging attention matrices, resulting in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning.
翻译:通过自监督方式在无标签数据上训练大型基础模型,再在下游任务上进行微调,已成为标准流程。然而,该方法的有效性常受限于有限的计算资源和下游标注数据的稀缺性。本文提出多模态注意力合并(MAM),这是一种尝试在零样本范式下,将源自高资源模态(文本与图像)模型的注意力矩阵知识直接迁移至资源受限领域(语音与音频)的方法。MAM使自动语音识别(ASR)模型的相对词错误率(WER)降低高达6.70%,使音频事件分类(AEC)模型的相对分类误差降低10.63%。在具备部分数据或计算资源的情况下,我们提出可学习的L-MAM,通过数据驱动方式合并注意力矩阵,相较于微调方法,可使ASR的相对WER进一步降低2.90%,AEC的相对分类误差降低18.42%。