Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization

Multimodal semantic communication, which integrates various data modalities such as text, images, and audio, significantly enhances communication efficiency and reliability. It has broad application prospects in fields such as artificial intelligence, autonomous driving, and smart homes. However, current research primarily relies on analog channels and assumes constant channel states (perfect CSI), which is inadequate for addressing dynamic physical channels and noise in real-world scenarios. Existing methods often focus on single modality tasks and fail to handle multimodal stream data, such as video and audio, and their corresponding tasks. Furthermore, current semantic encoding and decoding modules mainly transmit single modality features, neglecting the need for multimodal semantic enhancement and recognition tasks. To address these challenges, this paper proposes a pilot-guided framework for multimodal semantic communication specifically tailored for audio-visual event localization tasks. This framework utilizes digital pilot codes and channel modules to guide the state of analog channels in real-wold scenarios and designs Euler-based multimodal semantic encoding and decoding that consider time-frequency characteristics based on dynamic channel state. This approach effectively handles multimodal stream source data, especially for audio-visual event localization tasks. Extensive numerical experiments demonstrate the robustness of the proposed framework in channel changes and its support for various communication scenarios. The experimental results show that the framework outperforms existing benchmark methods in terms of Signal-to-Noise Ratio (SNR), highlighting its advantage in semantic communication quality.

翻译：多模态语义通信通过整合文本、图像和音频等多种数据模态，显著提升了通信效率与可靠性，在人工智能、自动驾驶和智能家居等领域具有广阔的应用前景。然而，现有研究主要依赖模拟信道并假设恒定信道状态（完美CSI），难以应对实际场景中的动态物理信道与噪声。现有方法通常专注于单模态任务，无法处理视频和音频等多模态流数据及其对应任务。此外，当前的语义编码与解码模块主要传输单模态特征，忽视了多模态语义增强与识别任务的需求。为应对这些挑战，本文提出一种专为音视频事件定位任务设计的导频引导多模态语义通信框架。该框架利用数字导频码与信道模块引导实际场景中模拟信道的状态，并基于动态信道状态设计了考虑时频特性的欧拉多模态语义编解码方法。该方法能有效处理多模态流源数据，尤其适用于音视频事件定位任务。大量数值实验证明了所提框架在信道变化下的鲁棒性及其对多种通信场景的支持能力。实验结果表明，该框架在信噪比（SNR）方面优于现有基准方法，凸显了其在语义通信质量上的优势。