Multimodal signals, including text, audio, image, and video, can be integrated into Semantic Communication (SC) systems to provide an immersive experience with low latency and high quality at the semantic level. However, the multimodal SC has several challenges, including data heterogeneity, semantic ambiguity, and signal distortion during transmission. Recent advancements in large AI models, particularly in the Multimodal Language Model (MLM) and Large Language Model (LLM), offer potential solutions for addressing these issues. To this end, we propose a Large AI Model-based Multimodal SC (LAM-MSC) framework, where we first present the MLM-based Multimodal Alignment (MMA) that utilizes the MLM to enable the transformation between multimodal and unimodal data while preserving semantic consistency. Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows users to perform personalized semantic extraction or recovery through the LLM. This effectively addresses the semantic ambiguity. Finally, we apply the Conditional Generative adversarial network-based channel Estimation (CGE) for estimating the wireless channel state information. This approach effectively mitigates the impact of fading channels in SC. Finally, we conduct simulations that demonstrate the superior performance of the LAM-MSC framework.
翻译:跨模态信号(包括文本、音频、图像和视频)可被整合至语义通信系统中,以在语义层面提供低延迟、高质量的沉浸式体验。然而,跨模态语义通信面临诸多挑战,包括数据异构性、语义模糊性以及传输过程中的信号失真。大规模人工智能模型的最新进展,特别是在多模态语言模型和大语言模型领域,为应对这些挑战提供了潜在的解决方案。为此,我们提出一种基于大规模人工智能模型的跨模态语义通信框架。在该框架中,我们首先提出基于多模态语言模型的多模态对齐方法,该方法利用多模态语言模型实现跨模态与单模态数据间的转换,同时保持语义一致性。随后,我们提出一种基于大语言模型的个性化知识库,使用户能够通过大语言模型执行个性化的语义提取或恢复,从而有效解决语义模糊性问题。最后,我们应用基于条件生成对抗网络的信道估计方法来估计无线信道状态信息,该方案能有效抑制衰落信道对语义通信的影响。仿真实验结果表明,所提出的跨模态语义通信框架具有优越的性能。