Generative models for synthesizing audio textures explicitly encode controllability by conditioning the model with labelled data. While datasets for audio textures can be easily recorded in-the-wild, semantically labeling them is expensive, time-consuming, and prone to errors due to human annotator subjectivity. Thus, to control generation, there is a need to automatically infer user-defined perceptual factors of variation in the latent space of a generative model while modelling unlabeled textures. In this paper, we propose an example-based framework to determine vectors to guide texture generation based on user-defined semantic attributes. By synthesizing a few synthetic examples to indicate the presence or absence of a semantic attribute, we can infer the guidance vectors in the latent space of a generative model to control that attribute during generation. Our results show that our method is capable of finding perceptually relevant and deterministic guidance vectors for controllable generation for both discrete as well as continuous textures. Furthermore, we demonstrate the application of this method to other tasks such as selective semantic attribute transfer.
翻译:用于合成音频纹理的生成模型通过使用带标签的数据对模型进行条件约束,显式地编码可控性。尽管音频纹理数据集可以轻松地在野外录制,但对其进行语义标注却耗时、昂贵,且易受人类标注者主观性影响而产生错误。因此,为了控制生成过程,需要在未标记纹理建模的同时,自动推断生成模型潜在空间中用户定义的感知变化因子。本文提出了一种基于示例的框架,用于根据用户定义的语义属性确定引导纹理生成的向量。通过合成少量示例来指示语义属性的存在与否,我们可以在生成模型的潜在空间中推断出控制该属性的引导向量。实验结果表明,该方法能够为离散纹理和连续纹理找到感知相关且确定性的引导向量,实现可控生成。此外,我们还展示了该方法在其他任务(如选择性语义属性迁移)中的应用。