Direct speech-to-image generation has recently shown promising results. However, compared to text-to-image generation, there is still a large gap to enclose. Current approaches use two stages to tackle this task: speech encoding network and image generative adversarial network (GAN). The speech encoding networks in these approaches produce embeddings that do not capture sufficient linguistic information to semantically represent the input speech. GANs suffer from issues such as non-convergence, mode collapse, and diminished gradient, which result in unstable model parameters, limited sample diversity, and ineffective generator learning, respectively. To address these weaknesses, we introduce a framework called \textbf{Speak the Art (STA)} which consists of a speech encoding network and a VQ-Diffusion network conditioned on speech embeddings. To improve speech embeddings, the speech encoding network is supervised by a large pre-trained image-text model during training. Replacing GANs with diffusion leads to more stable training and the generation of diverse images. Additionally, we investigate the feasibility of extending our framework to be multilingual. As a proof of concept, we trained our framework with two languages: English and Arabic. Finally, we show that our results surpass state-of-the-art models by a large margin.
翻译:直接语音到图像生成技术近期已展现出有前景的结果。然而,与文本到图像生成相比,仍存在较大差距需要弥合。现有方法采用两阶段处理此任务:语音编码网络与图像生成对抗网络(GAN)。这些方法中的语音编码网络所产生的嵌入向量未能充分捕捉输入语音的语义语言学信息。GAN则面临非收敛、模式崩溃及梯度衰减等问题,分别导致模型参数不稳定、样本多样性受限以及生成器学习效率低下。为克服这些不足,我们提出一个名为**Speak the Art (STA)**的框架,该框架包含语音编码网络以及基于语音嵌入向量进行条件生成的VQ-Diffusion网络。为提升语音嵌入向量质量,我们在训练过程中通过大型预训练图文模型对语音编码网络进行监督。用扩散模型替代GAN可实现更稳定的训练过程并生成更具多样性的图像。此外,我们探究了将本框架扩展至多语言应用的可行性。作为概念验证,我们使用英语和阿拉伯语两种语言对框架进行了训练。最终实验表明,我们的结果以显著优势超越了现有最优模型。