Semantic communication is a promising technology to improve communication efficiency by transmitting only the semantic information of the source data. However, traditional semantic communication methods primarily focus on data reconstruction tasks, which may not be efficient for emerging generative tasks such as text-to-speech (TTS) synthesis. To address this limitation, this paper develops a novel generative semantic communication framework for TTS synthesis, leveraging generative artificial intelligence technologies. Firstly, we utilize a pre-trained large speech model called WavLM and the residual vector quantization method to construct two semantic knowledge bases (KBs) at the transmitter and receiver, respectively. The KB at the transmitter enables effective semantic extraction, while the KB at the receiver facilitates lifelike speech synthesis. Then, we employ a transformer encoder and a diffusion model to achieve efficient semantic coding without introducing significant communication overhead. Finally, numerical results demonstrate that our framework achieves much higher fidelity for the generated speech than four baselines, in both cases with additive white Gaussian noise channel and Rayleigh fading channel.
翻译:语义通信是一种通过仅传输源数据的语义信息来提高通信效率的前沿技术。然而,传统的语义通信方法主要关注数据重建任务,这对于文本转语音(TTS)合成等新兴生成式任务可能效率不高。为应对这一局限,本文开发了一种面向TTS合成的新型生成式语义通信框架,利用了生成式人工智能技术。首先,我们利用一个名为WavLM的预训练大型语音模型和残差向量量化方法,分别在发射端和接收端构建了两个语义知识库(KB)。发射端的知识库实现了有效的语义提取,而接收端的知识库则促进了逼真的语音合成。接着,我们采用一个Transformer编码器和一个扩散模型来实现高效的语义编码,而不会引入显著的通信开销。最后,数值结果表明,无论是在加性高斯白噪声信道还是瑞利衰落信道条件下,我们的框架所生成语音的保真度均远高于四种基线方法。