Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.
翻译:现有的大多数文本到音频(TTA)生成方法仅产生单声道输出,忽略了沉浸式听觉体验所必需的空间信息。为解决这一问题,我们提出了一种级联方法,用于实现具有时间和空间控制的文本到多源双耳音频生成(TTMBA)。首先,一个预训练的大型语言模型(LLM)将文本分割为结构化格式,其中包含每个声音事件的时间和空间细节。接着,一个预训练的单声道音频生成网络为每个事件生成具有不同时长的多个单声道音频。然后,这些单声道音频基于LLM提供的空间数据,通过一个双耳渲染神经网络转换为双耳音频。最后,这些双耳音频根据其起始时间进行排列,从而生成多源双耳音频。实验结果表明,所提方法在音频生成质量和空间感知准确性方面均具有优越性。