Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that hinder deployment on low-resource devices. We address these limitations with a lightweight model that extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments given a specified number of instrument classes. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in terms of transcription accuracy and separation quality, and shows promising generalization ability, making it highly suitable for real-world deployment in practical and resource-constrained settings.
翻译:现有的多音色转录模型存在以下局限性:难以泛化至预训练乐器之外、音源数量约束过于刚性、计算需求过高阻碍在低资源设备上的部署。针对这些问题,我们提出一种轻量级模型,该模型通过扩展一个音色无关的转录主干网络,引入专用音色编码器,并在音符层面进行深度聚类,从而能够在给定乐器类别数量的情况下,实现对任意乐器的联合转录与动态分离。包括谱归一化、空洞卷积和对比聚类在内的实际优化进一步提升了模型的效率与鲁棒性。尽管模型尺寸小、推理速度快,其在转录准确性和分离质量方面仍能达到与更重基线模型相竞争的性能,并展现出良好的泛化能力,使其非常适合在实际及资源受限的环境中部署。