Recent advancements in audio generation have enabled the creation of high-fidelity audio clips from free-form textual descriptions. However, temporal relationships, a critical feature for audio content, are currently underrepresented in mainstream models, resulting in an imprecise temporal controllability. Specifically, users cannot accurately control the timestamps of sound events using free-form text. We acknowledge that a significant factor is the absence of high-quality, temporally-aligned audio-text datasets, which are essential for training models with temporal control. The more temporally-aligned the annotations, the better the models can understand the precise relationship between audio outputs and temporal textual prompts. Therefore, we present a strongly aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps, duration, frequency, and ordering, covering almost all aspects of temporal control. Additionally, we offer a comprehensive test set and evaluation metric to assess the temporal control performance of various models. Examples are available on the https://zeyuxie29.github.io/AudioTime/
翻译:近期音频生成技术的进步使得能够根据自由形式的文本描述生成高保真度的音频片段。然而,时间关系作为音频内容的关键特征,在当前主流模型中尚未得到充分体现,导致时间可控性不够精确。具体而言,用户无法通过自由文本准确控制声音事件的时间戳。我们认为,一个重要的原因在于缺乏高质量的时间对齐音频-文本数据集,而这对于训练具有时间控制能力的模型至关重要。注释的时间对齐程度越高,模型就越能理解音频输出与时间性文本提示之间的精确关系。因此,我们提出了一个强对齐的音频-文本数据集AudioTime。它提供了富含时间信息的文本注释,如时间戳、持续时间、频率和顺序,几乎涵盖了时间控制的所有方面。此外,我们还提供了一个全面的测试集和评估指标,用于评估各种模型的时间控制性能。示例可在https://zeyuxie29.github.io/AudioTime/ 上查看。