Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that hinder deployment on low-resource devices. We address these limitations with a lightweight model that extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments given a specified number of instrument classes. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in terms of transcription accuracy and separation quality, and shows promising generalization ability, making it highly suitable for real-world deployment in practical and resource-constrained settings.
翻译:现有复音色转录模型在预训练乐器外的泛化能力、严格的音源数量限制以及高计算资源需求方面存在局限,阻碍了其在低资源设备上的部署。针对这些问题,我们提出一种轻量级模型:在音色无关的转录主干网络基础上,集成专用音色编码器,并在音符级执行深度聚类。该模型可在指定乐器类别数量的前提下,实现任意乐器的联合转录与动态分离。通过引入频谱归一化、空洞卷积和对比聚类等实用优化技术,进一步提升了运算效率与鲁棒性。尽管模型体积小、推理速度快,但在转录精度与分离质量上仍能与更重的基线模型相媲美,展现出良好的泛化能力,特别适用于资源受限的实际部署场景。