The goal of a speech-to-image transform is to produce a photo-realistic picture directly from a speech signal. Recently, various studies have focused on this task and have achieved promising performance. However, current speech-to-image approaches are based on a stacked modular framework that suffers from three vital issues: 1) Training separate networks is time-consuming as well as inefficient and the convergence of the final generative model strongly depends on the previous generators; 2) The quality of precursor images is ignored by this architecture; 3) Multiple discriminator networks are required to be trained. To this end, we propose an efficient and effective single-stage framework called Fusion-S2iGan to yield perceptually plausible and semantically consistent image samples on the basis of given spoken descriptions. Fusion-S2iGan introduces a visual+speech fusion module (VSFM), constructed with a pixel-attention module (PAM), a speech-modulation module (SMM) and a weighted-fusion module (WFM), to inject the speech embedding from a speech encoder into the generator while improving the quality of synthesized pictures. Fusion-S2iGan spreads the bimodal information over all layers of the generator network to reinforce the visual feature maps at various hierarchical levels in the architecture. We conduct a series of experiments on four benchmark data sets, i.e., CUB birds, Oxford-102, Flickr8k and Places-subset. The experimental results demonstrate the superiority of the presented Fusion-S2iGan compared to the state-of-the-art models with a multi-stage architecture and a performance level that is close to traditional text-to-image approaches.
翻译:语音到图像转换的目标是从语音信号直接生成逼真的图像。近年来,多项研究聚焦于该任务并取得了良好性能。然而,当前的语音到图像方法基于堆叠式模块化框架,存在三个关键问题:1)训练独立网络耗时且低效,最终生成模型的收敛性严重依赖前序生成器;2)该架构忽略了前驱图像的质量;3)需要训练多个判别器网络。为此,我们提出了一种高效且有效的单阶段框架Fusion-S2iGan,基于给定的语音描述生成感知合理且语义一致的图像样本。Fusion-S2iGan引入由像素注意力模块(PAM)、语音调制模块(SMM)和加权融合模块(WFM)构成的视觉-语音融合模块(VSFM),在提升合成图像质量的同时,将语音编码器的语音嵌入注入生成器。该框架将双模态信息扩散至生成器网络的所有层级,以增强架构中不同层次级别的视觉特征图。我们在四个基准数据集(CUB鸟类、Oxford-102、Flickr8k和Places-subset)上开展了一系列实验。实验结果表明,与采用多阶段架构的最先进模型相比,所提出的Fusion-S2iGan具有优越性,其性能水平接近传统文本到图像方法。