Visual-Language Models (VLMs), with their strong capabilities in image and text understanding, offer a solid foundation for intelligent communications. However, their effectiveness is constrained by limited token granularity, overlong visual token sequences, and inadequate cross-modal alignment. To overcome these challenges, we propose TaiChi, a novel VLM framework designed for token communications. TaiChi adopts a dual-visual tokenizer architecture that processes both high- and low-resolution images to collaboratively capture pixel-level details and global conceptual features. A Bilateral Attention Network (BAN) is introduced to intelligently fuse multi-scale visual tokens, thereby enhancing visual understanding and producing compact visual tokens. In addition, a Kolmogorov Arnold Network (KAN)-based modality projector with learnable activation functions is employed to achieve precise nonlinear alignment from visual features to the text semantic space, thus minimizing information loss. Finally, TaiChi is integrated into a multimodal and multitask token communication system equipped with a joint VLM-channel coding scheme. Experimental results validate the superior performance of TaiChi, as well as the feasibility and effectiveness of the TaiChi-driven token communication system.
翻译:视觉语言模型凭借其在图像与文本理解方面的强大能力,为智能通信奠定了坚实基础。然而,其效能受限于有限的令牌粒度、过长的视觉令牌序列以及不足的跨模态对齐。为克服这些挑战,我们提出了TaiChi——一种专为令牌通信设计的新型视觉语言模型框架。TaiChi采用双视觉分词器架构,同时处理高分辨率与低分辨率图像,以协同捕捉像素级细节与全局概念特征。我们引入了双边注意力网络,以智能融合多尺度视觉令牌,从而增强视觉理解并生成紧凑的视觉令牌。此外,采用基于Kolmogorov Arnold网络、具有可学习激活函数的模态投影器,实现从视觉特征到文本语义空间的精确非线性对齐,从而最小化信息损失。最后,TaiChi被集成到一个配备联合VLM-信道编码方案的多模态多任务令牌通信系统中。实验结果验证了TaiChi的优越性能,以及TaiChi驱动的令牌通信系统的可行性与有效性。