We introduce the text-to-instrument task, which aims at generating sample-based musical instruments based on textual prompts. Accordingly, we propose InstrumentGen, a model that extends a text-prompted generative audio framework to condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding. Furthermore, we present a differentiable loss function to evaluate the intra-instrument timbral consistency of sample-based instruments. Our results establish a foundational text-to-instrument baseline, extending research in the domain of automatic sample-based instrument generation.
翻译:我们提出了文本到乐器任务,旨在根据文本提示生成基于采样的乐器。为此,我们提出了InstrumentGen模型,该模型扩展了基于文本提示的生成式音频框架,以乐器家族、源类型、音高(覆盖88键范围)、力度以及联合文本/音频嵌入为条件。此外,我们提出了一种可微损失函数,用于评估采样乐器内部的音色一致性。我们的结果建立了基础的文本到乐器基线,扩展了自动采样乐器生成领域的研究。