Automatic audio event recognition plays a pivotal role in making human robot interaction more closer and has a wide applicability in industrial automation, control and surveillance systems. Audio event is composed of intricate phonic patterns which are harmonically entangled. Audio recognition is dominated by low and mid-level features, which have demonstrated their recognition capability but they have high computational cost and low semantic meaning. In this paper, we propose a new computationally efficient framework for audio recognition. Audio Bank, a new high-level representation of audio, is comprised of distinctive audio detectors representing each audio class in frequency-temporal space. Dimensionality of the resulting feature vector is reduced using non-negative matrix factorization preserving its discriminability and rich semantic information. The high audio recognition performance using several classifiers (SVM, neural network, Gaussian process classification and k-nearest neighbors) shows the effectiveness of the proposed method.
翻译:自动音频事件识别在使人机交互更趋自然方面发挥着关键作用,并在工业自动化、控制与监控系统中具有广泛适用性。音频事件由谐波交织的复杂语音模式构成。当前音频识别主要依赖低层与中层特征,这些特征虽已证明其识别能力,但存在计算成本高、语义信息弱的问题。本文提出一种新型高效计算框架用于音频识别。该框架的核心是音频银行——一种全新高层次音频表示方法,它由在频率-时间空间上表征每个音频类别的独特音频检测器组成。通过非负矩阵分解降低所得特征向量的维度,同时保持其可判别性与丰富语义信息。采用多种分类器(支持向量机、神经网络、高斯过程分类与k近邻算法)进行音频识别的高性能表现,验证了所提方法的有效性。