Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for Speech-to-Image Generation

The goal of a speech-to-image transform is to produce a photo-realistic picture directly from a speech signal. Recently, various studies have focused on this task and have achieved promising performance. However, current speech-to-image approaches are based on a stacked modular framework that suffers from three vital issues: 1) Training separate networks is time-consuming as well as inefficient and the convergence of the final generative model strongly depends on the previous generators; 2) The quality of precursor images is ignored by this architecture; 3) Multiple discriminator networks are required to be trained. To this end, we propose an efficient and effective single-stage framework called Fusion-S2iGan to yield perceptually plausible and semantically consistent image samples on the basis of given spoken descriptions. Fusion-S2iGan introduces a visual+speech fusion module (VSFM), constructed with a pixel-attention module (PAM), a speech-modulation module (SMM) and a weighted-fusion module (WFM), to inject the speech embedding from a speech encoder into the generator while improving the quality of synthesized pictures. Fusion-S2iGan spreads the bimodal information over all layers of the generator network to reinforce the visual feature maps at various hierarchical levels in the architecture. We conduct a series of experiments on four benchmark data sets, i.e., CUB birds, Oxford-102, Flickr8k and Places-subset. The experimental results demonstrate the superiority of the presented Fusion-S2iGan compared to the state-of-the-art models with a multi-stage architecture and a performance level that is close to traditional text-to-image approaches.

翻译：语音到图像转换的目标是从语音信号直接生成逼真的图像。近年来，多项研究聚焦于该任务并取得了良好性能。然而，当前的语音到图像方法基于堆叠式模块化框架，存在三个关键问题：1）训练独立网络耗时且低效，最终生成模型的收敛性严重依赖前序生成器；2）该架构忽略了前驱图像的质量；3）需要训练多个判别器网络。为此，我们提出了一种高效且有效的单阶段框架Fusion-S2iGan，基于给定的语音描述生成感知合理且语义一致的图像样本。Fusion-S2iGan引入由像素注意力模块（PAM）、语音调制模块（SMM）和加权融合模块（WFM）构成的视觉-语音融合模块（VSFM），在提升合成图像质量的同时，将语音编码器的语音嵌入注入生成器。该框架将双模态信息扩散至生成器网络的所有层级，以增强架构中不同层次级别的视觉特征图。我们在四个基准数据集（CUB鸟类、Oxford-102、Flickr8k和Places-subset）上开展了一系列实验。实验结果表明，与采用多阶段架构的最先进模型相比，所提出的Fusion-S2iGan具有优越性，其性能水平接近传统文本到图像方法。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日