Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.
翻译:音频歌词匹配作为基于内容检索方法的一种替代方案具有吸引力,但现有方法常受限于可复现性不足与基线标准不一致的问题。本研究提出WEALLY——一个完全可复现的流程框架,利用Whisper解码器嵌入进行歌词匹配任务。该框架建立了稳健透明的基线标准,同时探索了融合文本与声学特征的多模态扩展方案。通过在标准数据集上的大量实验,我们证明WEALLY在性能上可与当前缺乏可复现性的前沿方法相媲美。此外,我们提供了关于语言鲁棒性、损失函数与嵌入策略的消融实验与分析。本研究为未来研究提供了可靠的基准,并凸显了语音技术在音乐信息检索任务中的应用潜力。