This paper proposes an audio fingerprinting model with holographic reduced representation (HRR). The proposed method reduces the number of stored fingerprints, whereas conventional neural audio fingerprinting requires many fingerprints for each audio track to achieve high accuracy and time resolution. We utilize HRR to aggregate multiple fingerprints into a composite fingerprint via circular convolution and summation, resulting in fewer fingerprints with the same dimensional space as the original. Our search method efficiently finds a combined fingerprint in which a query fingerprint exists. Using HRR's inverse operation, it can recover the relative position within a combined fingerprint, retaining the original time resolution. Experiments show that our method can reduce the number of fingerprints with modest accuracy degradation while maintaining the time resolution, outperforming simple decimation and summation-based aggregation methods.
翻译:本文提出了一种采用全息降维表示的音频指纹识别模型。所提方法减少了存储指纹的数量,而传统的神经音频指纹识别方法需要为每个音频轨道存储大量指纹才能实现高精度和时间分辨率。我们利用全息降维表示,通过循环卷积和求和将多个指纹聚合为一个复合指纹,从而在保持与原始指纹相同维度空间的前提下减少指纹数量。我们的搜索方法能高效地找到包含查询指纹的复合指纹。利用全息降维表示的逆运算,该方法可以恢复查询指纹在复合指纹中的相对位置,从而保留原始时间分辨率。实验表明,我们的方法能够在保持时间分辨率的同时减少指纹数量,且精度下降有限,其性能优于基于简单抽取和求和的聚合方法。