This paper introduces a modeling approach that employs multi-level global processing, encompassing both short-term frame-level and long-term sample-level feature scales. In the initial stage of shallow feature extraction, various scales are employed to extract multi-level features, including Mel-Frequency Cepstral Coefficients (MFCC) and pre-Fbank log energy spectrum. The construction of the identification network model involves considering the input two-dimensional temporal features from both frame and sample levels. Specifically, the model initially employs one-dimensional convolution-based Convolutional Long Short-Term Memory (ConvLSTM) to fuse spatiotemporal information and extract short-term frame-level features. Subsequently, bidirectional long Short-Term Memory (BiLSTM) is utilized to learn long-term sample-level sequential representations. The transformer encoder then performs cross-scale, multi-level processing on global frame-level and sample-level features, facilitating deep feature representation and fusion at both levels. Finally, recognition results are obtained through Softmax. Our method achieves an impressive 99.6% recognition accuracy on the CCNU_Mobile dataset, exhibiting a notable improvement of 2% to 12% compared to the baseline system. Additionally, we thoroughly investigate the transferability of our model, achieving an 87.9% accuracy in a classification task on a new dataset.
翻译:本文提出一种融合短时帧级与长时样本级特征尺度的多层级全局处理建模方法。在浅层特征提取阶段,采用多尺度策略提取包括梅尔频率倒谱系数(MFCC)与预滤波对数能量谱在内的多层次特征。识别网络模型的构建同时考虑帧级与样本级的二维时序特征输入:首先采用基于一维卷积的卷积长短时记忆网络(ConvLSTM)融合时空信息并提取短时帧级特征;继而利用双向长短时记忆网络(BiLSTM)学习长时样本级序列表征;随后通过Transformer编码器对全局帧级与样本级特征进行跨尺度、多层级的深度特征表征与融合;最终通过Softmax获得识别结果。本方法在CCNU_Mobile数据集上取得99.6%的识别准确率,较基线系统提升2%至12%。此外,我们深入探究了模型的迁移能力,在新数据集的分类任务中达到87.9%的准确率。