In this paper, we study the application of Test-Time Training (TTT) as a solution to handling distribution shifts in speech applications. In particular, we introduce distribution-shifts to the test datasets of standard speech-classification tasks -- for example, speaker-identification and emotion-detection -- and explore how Test-Time Training (TTT) can help adjust to the distribution-shift. In our experiments that include distribution shifts due to background noise and natural variations in speech such as gender and age, we identify some key-challenges with TTT including sensitivity to optimization hyperparameters (e.g., number of optimization steps and subset of parameters chosen for TTT) and scalability (e.g., as each example gets its own set of parameters, TTT is not scalable). Finally, we propose using BitFit -- a parameter-efficient fine-tuning algorithm proposed for text applications that only considers the bias parameters for fine-tuning -- as a solution to the aforementioned challenges and demonstrate that it is consistently more stable than fine-tuning all the parameters of the model.
翻译:本文研究将测试时训练(TTT)作为处理语音应用中分布偏移问题的解决方案。我们特别针对标准语音分类任务(如说话人识别和情绪检测)的测试数据集引入分布偏移,探索TTT如何帮助适应此类偏移。在包含背景噪声、性别及年龄等自然语音变化所导致分布偏移的实验中,我们发现了TTT面临的关键挑战:对优化超参数(如优化步数及TTT所选参数子集)的敏感性,以及可扩展性问题(例如,因每个样本需独立参数集,TTT缺乏可扩展性)。最终,我们提出采用BitFit——一种针对文本应用提出的参数高效微调算法,仅考虑偏置参数进行微调——作为应对上述挑战的方案,并证明其相较于全参数微调具有更稳定的表现。