Environmental sound classification (ESC) has gained significant attention due to its diverse applications in smart city monitoring, fault detection, acoustic surveillance, and manufacturing quality control. To enhance CNN performance, feature stacking techniques have been explored to aggregate complementary acoustic descriptors into richer input representations. In this paper, we investigate CNN-based models employing various stacked feature combinations, including Log-Mel Spectrogram (LM), Spectral Contrast (SPC), Chroma (CH), Tonnetz (TZ), Mel-Frequency Cepstral Coefficients (MFCCs), and Gammatone Cepstral Coefficients (GTCC). Experiments are conducted on the widely used ESC-50 and UrbanSound8K datasets under different training regimes, including pretraining on ESC-50, fine-tuning on UrbanSound8K, and comparison with Audio Spectrogram Transformer (AST) models pretrained on large-scale corpora such as AudioSet. This experimental design enables an analysis of how feature-stacked CNNs compare with transformer-based models under varying levels of training data and pretraining diversity. The results indicate that feature-stacked CNNs offer a more computationally and data-efficient alternative when large-scale pretraining or extensive training data are unavailable, making them particularly well suited for resource-constrained and edge-level sound classification scenarios.
翻译:环境声音分类(ESC)因其在智慧城市监测、故障检测、声学监控和制造质量控制等领域的广泛应用而受到极大关注。为提升CNN性能,特征堆叠技术被用于将互补的声学描述子聚合为更丰富的输入表示。本文研究了采用多种堆叠特征组合的CNN模型,包括对数梅尔频谱(LM)、谱对比度(SPC)、色度特征(CH)、调性网格(TZ)、梅尔频率倒谱系数(MFCCs)和伽马通倒谱系数(GTCC)。实验在广泛使用的ESC-50和UrbanSound8K数据集上进行,采用不同训练策略,包括在ESC-50上的预训练、在UrbanSound8K上的微调,以及与基于大规模语料库(如AudioSet)预训练的音频频谱Transformer(AST)模型进行对比。该实验设计能够分析在不同训练数据规模和预训练多样性的条件下,基于特征堆叠的CNN与基于Transformer的模型之间的性能差异。结果表明,当大规模预训练或大量训练数据不可用时,特征堆叠CNN提供了计算和数据效率更高的替代方案,使其特别适用于资源受限和边缘端声音分类场景。