Audio fingerprinting provides an identifiable representation of acoustic signals, which can be later used for identification and retrieval systems. To obtain a discriminative representation, the input audio is usually segmented into shorter time intervals, allowing local acoustic features to be extracted and analyzed. Modern neural approaches typically operate on short, fixed-duration audio segments, yet the choice of segment duration is often made heuristically and rarely examined in depth. In this paper, we study how segment length affects audio fingerprinting performance. We extend an existing neural fingerprinting architecture to adopt various segment lengths and evaluate retrieval accuracy across different segment lengths and query durations. Our results show that short segment lengths (0.5-second) generally achieve better performance. Moreover, we evaluate LLM capacity in recommending the best segment length, which shows that GPT-5-mini consistently gives the best suggestions across five considerations among three studied LLMs. Our findings provide practical guidance for selecting segment duration in large-scale neural audio retrieval systems.
翻译:音频指纹识别为声学信号提供可识别的表示,该表示随后可用于识别与检索系统。为获得具有判别性的表示,输入音频通常被分割为较短的时间区间,以便提取和分析局部声学特征。现代神经方法通常在短时、固定时长的音频分段上操作,然而分段时长的选择往往基于启发式方法,且很少被深入探讨。本文研究分段长度如何影响音频指纹识别性能。我们扩展了一种现有的神经指纹识别架构,使其能够适配不同的分段长度,并评估了不同分段长度与查询时长下的检索准确率。我们的结果表明,较短的分段长度(0.5秒)通常能获得更优的性能。此外,我们评估了大型语言模型在推荐最佳分段长度方面的能力,结果显示在三种被研究的大型语言模型中,GPT-5-mini在五项考量指标上均能持续给出最佳建议。本研究结果为大规模神经音频检索系统中分段时长的选择提供了实用指导。