Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://PicoAudio.github.io.
翻译:近年来,音频生成任务已引起广泛的研究关注。精确的时间可控性对于将音频生成与实际应用相结合至关重要。本研究提出了一种时间可控的音频生成框架——PicoAudio。该框架通过定制化的模型设计,将时间信息整合以指导音频生成。它利用数据爬取、分割、过滤及细粒度时间对齐的音频-文本数据模拟技术。主客观评估均表明,PicoAudio在时间戳与发生频率的可控性方面显著超越了当前最先进的生成模型。生成样本可在演示网站 https://PicoAudio.github.io 上获取。