In this paper, we investigate untrained recurrent models from the Reservoir Computing (RC) paradigm for audio surveillance, focusing on bidirectional Echo State Networks with different depths, from shallow to deep configurations, for emergency sound event detection. We evaluate these models on the MIVIA Audio Events dataset in a multiclass setting across different Signal-to-Noise Ratio (SNR) levels, with the goal of assessing the trade-off between depth, recognition performance, and computational efficiency. We compare the proposed architectures against fully trained recurrent and convolutional-recurrent baselines, namely Bidirectional Long Short-Term Memory networks (BiLSTMs) and Convolutional Recurrent Neural Networks (CRNNs). Results show that deep and shallow reservoir-based models achieve competitive recognition rates, with deeper variants being more robust in highly noisy conditions and shallower ones offering the most favorable efficiency profile, particularly on edge devices such as the NVIDIA Orin. In addition, the proposed approach remains robust across different input representations, including log-Mel spectrograms and MFCCs with varying resolutions. These findings highlight untrained reservoir architectures as a promising solution for resource-constrained audio surveillance scenarios.
翻译:暂无翻译