In emergencies, the ability to quickly and accurately gather environmental data and command information, and to make timely decisions, is particularly critical. Traditional semantic communication frameworks, primarily based on a single modality, are susceptible to complex environments and lighting conditions, thereby limiting decision accuracy. To this end, this paper introduces a multimodal generative semantic communication framework named mm-GESCO. The framework ingests streams of visible and infrared modal image data, generates fused semantic segmentation maps, and transmits them using a combination of one-hot encoding and zlib compression techniques to enhance data transmission efficiency. At the receiving end, the framework can reconstruct the original multimodal images based on the semantic maps. Additionally, a latent diffusion model based on contrastive learning is designed to align different modal data within the latent space, allowing mm-GESCO to reconstruct latent features of any modality presented at the input. Experimental results demonstrate that mm-GESCO achieves a compression ratio of up to 200 times, surpassing the performance of existing semantic communication frameworks and exhibiting excellent performance in downstream tasks such as object classification and detection.
翻译:在应急情况下,快速准确地收集环境数据与指挥信息并做出及时决策的能力尤为关键。传统语义通信框架主要基于单一模态,易受复杂环境和光照条件影响,从而限制了决策准确性。为此,本文提出一种名为 mm-GESCO 的多模态生成式语义通信框架。该框架接收可见光与红外模态的图像数据流,生成融合语义分割图,并采用独热编码与 zlib 压缩技术相结合的方式进行传输以提升数据传输效率。在接收端,该框架可根据语义图重建原始多模态图像。此外,本文设计了一种基于对比学习的潜在扩散模型,用于在潜在空间中对齐不同模态数据,使 mm-GESCO 能够重建输入中任意模态的潜在特征。实验结果表明,mm-GESCO 可实现高达 200 倍的压缩比,其性能超越现有语义通信框架,并在目标分类与检测等下游任务中表现出优异性能。