Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.
翻译:音轨检索(根据给定音频混音匹配缺失音轨的任务)是目前因模型丢弃时间信息而受限的关键挑战。我们提出PHALAR框架,该对比学习框架在参数规模减少50%以上、训练速度提升7倍的情况下,相较于现有最优方法实现约70%的相对准确率提升。通过采用学习型谱池化层与复数域头模块,PHALAR强制引入音高等变性与相位等变性偏置。该框架在MoisesDB、Slakh与ChocoChorales数据集上刷新了音轨检索最优记录,其与人类连贯性判断的相关性显著高于语义基线方法。最后,零样本节拍追踪与线性和弦探测实验证实PHALAR能够捕获超越检索任务本身的鲁棒音乐结构。