The integration of Foundation Models (FMs) and wireless communications is driving the evolution of image communication from bit-accurate transmission toward task-oriented transmission. However, existing task-oriented image communication methods still face three major challenges: insufficient task-oriented Token representation, inadequate collaboration between Visual Tokens and Task Tokens, and limited interpretability of task decisions. To address these challenges, we propose an Explainable Task-Oriented Token Communication (ET-TokenCom) framework. By treating Tokens as unified units for information representation and transmission, the proposed framework constructs an end-to-end communication link that spans visual perception, wireless transmission, and task reasoning. At the transmitter, the ET-TokenCom framework extracts Visual Tokens from images to preserve low-level visual information. Meanwhile, Task Tokens generated by the FM are introduced to represent the target information and decision intent required by the current task. A Cross-Modal Attention (CMA) fusion mechanism is further designed, enabling Task Tokens to explicitly guide the selection, weighting, and transmission of Visual Tokens. At the receiver, the framework integrates Token decoding with an explainable output mechanism, where attention heatmaps are generated to highlight critical perceptual regions under different task objectives and reveal the influence of Task Tokens on the outputs. Finally, simulation results validate the effectiveness and robustness of the proposed ET-TokenCom framework.
翻译:基础模型(FMs)与无线通信的融合正推动图像通信从比特精确传输向任务导向传输演进。然而,现有任务导向图像通信方法仍面临三大挑战:任务导向Token表示不足、视觉Token与任务Token协作不充分、以及任务决策的可解释性有限。为解决这些问题,我们提出可解释任务导向Token通信(ET-TokenCom)框架。该框架以Token作为信息表示与传输的统一单元,构建跨越视觉感知、无线传输和任务推理的端到端通信链路。在发送端,ET-TokenCom框架从图像中提取视觉Token以保留低层视觉信息,同时引入由基础模型生成的任务Token表征当前任务所需的目标信息与决策意图。进一步设计跨模态注意力(CMA)融合机制,使任务Token能够显式指导视觉Token的选择、加权与传输。在接收端,该框架将Token解码与可解释输出机制结合,通过生成注意力热图突出不同任务目标下的关键感知区域,并揭示任务Token对输出的影响。最后,仿真结果验证了所提ET-TokenCom框架的有效性与鲁棒性。