Audio fingerprinting is a well-established solution for song identification from short recording excerpts. Popular methods rely on the extraction of sparse representations, generally spectral peaks, and have proven to be accurate, fast, and scalable to large collections. However, real-world applications of audio identification often happen in noisy environments, which can cause these systems to fail. In this work, we tackle this problem by introducing and releasing a new audio augmentation pipeline that adds noise to music snippets in a realistic way, by stochastically mimicking real-world scenarios. We then propose and release a deep learning model that removes noisy components from spectrograms in order to improve peak-based fingerprinting systems' accuracy. We show that the addition of our model improves the identification performance of commonly used audio fingerprinting systems, even under noisy conditions.
翻译:音频指纹识别是一种成熟的从短录音片段中识别歌曲的解决方案。主流方法依赖于稀疏表示(通常为频谱峰值)的提取,已被证明具有高准确性、快速性和对大规模音乐库的可扩展性。然而,现实场景中的音频识别应用常受嘈杂环境影响,导致这些系统失效。针对这一问题,本文提出并公开了一种新型音频增强流水线,通过随机模拟真实场景,以逼真方式向音乐片段添加噪声。随后,我们设计并发布了一个深度学习模型,用于从频谱图中去除噪声成分,从而提升基于峰值的指纹识别系统的准确性。实验证明,即使在噪声条件下,该模型的引入也能显著提高常用音频指纹识别系统的识别性能。