Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Token Communication (TokenCom) is a new paradigm, motivated by the recent success of Large AI Models (LAMs) and Multimodal Large Language Models (MLLMs), where tokens serve as unified units of communication and computation, enabling efficient semantic- and goal-oriented information exchange in future wireless networks. In this paper, we propose a novel Video TokenCom framework for textual intent-guided multi-rate video communication with Unequal Error Protection (UEP)-based source-channel coding adaptation. The proposed framework integrates user-intended textual descriptions with discrete video tokenization and unequal error protection to enhance semantic fidelity under restrictive bandwidth constraints. First, discrete video tokens are extracted through a pretrained video tokenizer, while text-conditioned vision-language modeling and optical-flow propagation are jointly used to identify tokens that correspond to user-intended semantics across space and time. Next, we introduce a semantic-aware multi-rate bit-allocation strategy, in which tokens highly related to the user intent are encoded using full codebook precision, whereas non-intended tokens are represented through reduced codebook precision differential encoding, enabling rate savings while preserving semantic quality. Finally, a source and channel coding adaptation scheme is developed to adapt bit allocation and channel coding to varying resources and link conditions. Experiments on various video datasets demonstrate that the proposed framework outperforms both conventional and semantic communication baselines, in perceptual and semantic quality on a wide SNR range.

翻译：令牌通信（TokenCom）是一种新兴范式，其灵感来源于大规模人工智能模型和多模态大语言模型的最新进展，其中令牌作为通信与计算的统一单元，能够在未来无线网络中实现高效、面向语义与目标的信息交换。本文提出一种新颖的视频TokenCom框架，用于实现基于不等差错保护（UEP）信源信道编码自适应的文本意图引导多速率视频通信。该框架将用户意图的文本描述与离散视频令牌化及不等差错保护相结合，以在严格带宽约束下提升语义保真度。首先，通过预训练的视频令牌化器提取离散视频令牌，同时联合使用文本条件视觉语言建模与光流传播技术，以识别在时空维度上符合用户意图语义的令牌。其次，我们提出一种语义感知的多速率比特分配策略：与用户意图高度相关的令牌采用完整码本精度编码，而非意图相关令牌则通过降低码本精度的差分编码表示，从而在保持语义质量的同时实现码率节省。最后，开发了一种信源信道编码自适应方案，使比特分配与信道编码能够根据变化的资源与链路条件进行动态调整。在多种视频数据集上的实验表明，所提框架在较宽信噪比范围内，其感知质量与语义质量均优于传统通信基线及语义通信基线方法。