The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns. Despite these advancements, the exploration into scaling, especially in the audio generation domain, remains limited, with previous efforts didn't extend into the high-fidelity (HiFi) 44.1kHz domain and suffering from both spectral discontinuities and blurriness in the high-frequency domain, alongside a lack of robustness against out-of-domain data. These limitations restrict the applicability of models to diverse use cases, including music and singing generation. Our work introduces Enhanced Various Audio Generation via Scalable Generative Adversarial Networks (EVA-GAN), yields significant improvements over previous state-of-the-art in spectral and high-frequency reconstruction and robustness in out-of-domain data performance, enabling the generation of HiFi audios by employing an extensive dataset of 36,000 hours of 44.1kHz audio, a context-aware module, a Human-In-The-Loop artifact measurement toolkit, and expands the model to approximately 200 million parameters. Demonstrations of our work are available at https://double-blind-eva-gan.cc.
翻译:大型模型的出现标志着机器学习新时代的到来,其通过利用海量数据集捕捉并合成复杂模式,显著超越了小型模型。尽管取得了这些进展,但在音频生成领域,特别是针对可扩展性的探索仍十分有限——先前的研究未能扩展至高保真(HiFi)44.1kHz频域,且存在高频频谱不连续与模糊问题,同时缺乏对域外数据的鲁棒性。这些局限性限制了模型在音乐及歌唱生成等多样化应用场景中的适用性。本文提出基于可扩展生成对抗网络的增强型多样化音频生成(EVA-GAN),在频谱重建、高频重建及域外数据鲁棒性方面较先前最先进技术取得显著提升,通过采用包含36,000小时44.1kHz音频的大规模数据集、上下文感知模块、人机协同伪影检测工具包,并将模型参数扩展至约2亿条,实现了高保真音频生成。我们的工作演示可访问https://double-blind-eva-gan.cc。