One way of expressing an environmental sound is using vocal imitations, which involve the process of replicating or mimicking the rhythms and pitches of sounds by voice. We can effectively express the features of environmental sounds, such as rhythms and pitches, using vocal imitations, which cannot be expressed by conventional input information, such as sound event labels, images, and texts, in an environmental sound synthesis model. Therefore, using vocal imitations as input for environmental sound synthesis will enable us to control the pitches and rhythms of sounds and generate diverse sounds. In this paper, we thus propose a framework for environmental sound conversion from vocal imitations to generate diverse sounds. We also propose a method of environmental sound synthesis from vocal imitations and sound event labels. Using sound event labels is expected to control the sound event class of the synthesized sound, which cannot be controlled by only vocal imitations. Our objective and subjective experimental results show that vocal imitations effectively control the pitches and rhythms of sounds and generate diverse sounds.
翻译:表达环境声音的一种方式是使用声音模仿,即通过语音复制或模仿声音的节奏与音调的过程。我们能够利用声音模仿有效表达环境声音的特征(如节奏和音调),而这些特征无法通过传统输入信息(如声音事件标签、图像和文本)在环境声合成模型中表达。因此,将声音模仿作为环境声合成的输入,将使我们能够控制声音的节奏与音调并生成多样化的声音。本文提出了一种基于声音模仿生成多样化声音的环境声转换框架。同时,我们提出了一种融合声音模仿和声音事件标签的环境声合成方法。使用声音事件标签有望控制合成声音的声音事件类别,而仅靠声音模仿无法实现这一控制。主客观实验结果表明,声音模仿能够有效控制声音的节奏与音调,并生成多样化的声音。