Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.
翻译:语音基础模型在低资源领域常因域失配和数据稀缺而表现不佳。我们提出Gumbel-BEARD域适配框架,通过端到端可训练的硬Gumbel-Softmax选择器实现Whisper编码器层的自动选取。该框架采用BEST-RQ目标进行自监督适配,可动态适应目标声学特性而无需手动调参。在MyST儿童语音语料库上的实验证明了其效率与可扩展性:仅用10小时标注数据进行微调,其性能即可匹配基于完整133小时标注集训练的完全监督基线。我们在MyST上使用Whisper-medium和OGI Spontaneous数据集上使用Whisper-small分别取得了8.21%和11.06%的最新词错误率(WER)结果。在CORAAL上的评估进一步证实了该方法对成人方言域偏移的鲁棒性,相对WER降低达6%,凸显了该方法在多样化低资源条件下的泛化能力。