In recent years, research on music transcription has focused mainly on architecture design and instrument-specific data acquisition. With the lack of availability of diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several works have explored multi-instrument transcription as a means to bolster the performance of models on low-resource tasks, but these methods face the same data availability issues. We propose Timbre-Trap, a novel framework which unifies music transcription and audio reconstruction by exploiting the strong separability between pitch and timbre. We train a single autoencoder to simultaneously estimate pitch salience and reconstruct complex spectral coefficients, selecting between either output during the decoding stage via a simple switch mechanism. In this way, the model learns to produce coefficients corresponding to timbre-less audio, which can be interpreted as pitch salience. We demonstrate that the framework leads to performance comparable to state-of-the-art instrument-agnostic transcription methods, while only requiring a small amount of annotated data.
翻译:摘要:近年来,音乐转录研究主要集中于架构设计与特定乐器数据采集。由于多样化数据集的缺乏,研究进展常局限于钢琴转录等单一乐器任务。部分研究探索了多乐器转录以增强模型在低资源任务上的表现,但这些方法同样面临数据可用性问题。我们提出Timbre-Trap这一新颖框架,通过利用音高与音色之间的强可分离性,将音乐转录与音频重建统一起来。我们训练单一自编码器,使其同步估计音高显著度并重建复杂频谱系数,在解码阶段通过简单开关机制选择任一输出。通过这种方式,模型学会生成对应无音色音频的系数(可解释为音高显著度)。实验证明,该框架在仅需少量标注数据的前提下,即可达到与现有最优乐器无关转录方法相媲美的性能。