We introduce Condition-Aware Self-Supervised Learning Representation (CA-SSLR), a generalist conditioning model broadly applicable to various speech-processing tasks. Compared to standard fine-tuning methods that optimize for downstream models, CA-SSLR integrates language and speaker embeddings from earlier layers, making the SSL model aware of the current language and speaker context. This approach reduces the reliance on input audio features while preserving the integrity of the base SSLR. CA-SSLR improves the model's capabilities and demonstrates its generality on unseen tasks with minimal task-specific tuning. Our method employs linear modulation to dynamically adjust internal representations, enabling fine-grained adaptability without significantly altering the original model behavior. Experiments show that CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a 10% relative reduction in LID errors, a 37% improvement in ASR CER on the ML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating its effectiveness.
翻译:我们提出了条件感知自监督学习表征(CA-SSLR),这是一种广泛适用于各类语音处理任务的通用条件模型。与针对下游模型进行优化的标准微调方法相比,CA-SSLR整合了来自早期层的语言和说话人嵌入,使SSL模型能够感知当前的语言和说话人上下文。这种方法减少了对输入音频特征的依赖,同时保持了基础SSLR的完整性。CA-SSLR提升了模型的能力,并通过最少的任务特定调优,在未见任务上展示了其泛化性。我们的方法采用线性调制来动态调整内部表征,实现细粒度的适应性,而无需显著改变原始模型行为。实验表明,CA-SSLR减少了可训练参数数量,缓解了过拟合,并在资源不足和未见任务上表现出色。具体而言,CA-SSLR在ML-SUPERB基准测试中实现了LID错误率相对降低10%、ASR CER提升37%,在VoxCeleb-1上SV EER降低27%,证明了其有效性。