Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that hinder deployment on low-resource devices. We address these limitations with a lightweight model that extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments given a specified number of instrument classes. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in terms of transcription accuracy and separation quality, and shows promising generalization ability, making it highly suitable for real-world deployment in practical and resource-constrained settings.
翻译:现有多种音色转录模型在预训练乐器以外的泛化能力、声源数量刚性约束以及计算资源高需求方面存在局限,难以部署于低资源设备。针对上述问题,我们提出一种轻量级模型:在音色无关的转录主干网络基础上,扩展专用音色编码器,并在音符层级实施深度聚类,从而在指定乐器类别数量的条件下,实现任意乐器的联合转录与动态分离。通过谱归一化、膨胀卷积与对比聚类等实用优化手段,进一步提升了效率与鲁棒性。尽管模型尺寸小且推理速度快,但在转录准确率与分离质量上仍可与体积更大的基线模型相竞争,并展现出良好的泛化能力,非常适合实际应用中资源受限场景的部署。