Virtual instrument generation requires maintaining consistent timbre across different pitches and velocities, a challenge that existing note-level models struggle to address. We present FlowSynth, which combines distributional flow matching (DFM) with test-time optimization for high-quality instrument synthesis. Unlike standard flow matching that learns deterministic mappings, DFM parameterizes the velocity field as a Gaussian distribution and optimizes via negative log-likelihood, enabling the model to express uncertainty in its predictions. This probabilistic formulation allows principled test-time search: we sample multiple trajectories weighted by model confidence and select outputs that maximize timbre consistency. FlowSynth outperforms the current state-of-the-art TokenSynth baseline in both single-note quality and cross-note consistency. Our approach demonstrates that modeling predictive uncertainty in flow matching, combined with music-specific consistency objectives, provides an effective path to professional-quality virtual instruments suitable for real-time performance.
翻译:虚拟乐器生成需要在不同音高和力度下保持音色一致性,这是现有音符级模型难以解决的问题。我们提出FlowSynth,该方法将分布流匹配(DFM)与测试时优化相结合以实现高质量乐器合成。与学习确定性映射的标准流匹配不同,DFM将速度场参数化为高斯分布并通过负对数似然进行优化,使模型能够表达预测中的不确定性。这种概率化表述支持基于原则的测试时搜索:我们根据模型置信度对多条轨迹进行加权采样,并选择最大化音色一致性的输出。FlowSynth在单音符质量与跨音符一致性方面均优于当前最先进的TokenSynth基线。我们的方法表明,在流匹配中建模预测不确定性,并结合音乐特定的一致性目标,为适用于实时演奏的专业级虚拟乐器提供了一条有效路径。