Commonly used features in spoken language identification (LID), such as mel-spectrogram or MFCC, lose high-frequency information due to windowing. The loss further increases for longer temporal contexts. To improve generalization of the low-resourced LID systems, we investigate an alternate feature representation, wavelet scattering transform (WST), that compensates for the shortcomings. To our knowledge, WST is not explored earlier in LID tasks. We first optimize WST features for multiple South Asian LID corpora. We show that LID requires low octave resolution and frequency-scattering is not useful. Further, cross-corpora evaluations show that the optimal WST hyper-parameters depend on both train and test corpora. Hence, we develop fused ECAPA-TDNN based LID systems with different sets of WST hyper-parameters to improve generalization for unknown data. Compared to MFCC, EER is reduced upto 14.05% and 6.40% for same-corpora and blind VoxLingua107 evaluations, respectively.
翻译:语种识别中常用的特征(如梅尔频谱图或MFCC)因加窗处理会丢失高频信息,且时间上下文越长信息损失越显著。为改善低资源语种识别系统的泛化能力,本文探索了一种可补偿上述缺陷的替代特征表示——小波散射变换。据我们所知,该技术在语种识别领域尚属首次应用。我们首先针对多个南亚语种语料库优化小波散射变换特征,实验表明语种识别需要较低倍频程分辨率,且频率散射并无实际效用。跨语料库评估进一步显示,最优小波散射变换超参数同时取决于训练集与测试集特征。为此,我们开发了融合ECAPA-TDNN的语种识别系统,通过配置不同小波散射变换超参数集来提升未知数据的泛化能力。相较于MFCC特征,在相同语料库与盲测VoxLingua107评估中,等错误率分别降低达14.05%与6.40%。