Foundation deep learning (DL) models are general models, designed to learn general, robust and adaptable representations of their target modality, enabling finetuning across a range of downstream tasks. These models are pretrained on large, unlabeled datasets using self-supervised learning (SSL). Foundation models have demonstrated better generalization than traditional supervised approaches, a critical requirement for wireless communications where the dynamic environment demands model adaptability. In this work, we propose and demonstrate the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning. We introduce a Masked Spectrogram Modeling (MSM) approach to pretrain the ViT in a self-supervised fashion. We evaluate the ViT-based foundation model on two downstream tasks: Channel State Information (CSI)-based Human Activity sensing and Spectrogram Segmentation. Experimental results demonstrate competitive performance to supervised training while generalizing across diverse domains. Notably, the pretrained ViT model outperforms a four-times larger model that is trained from scratch on the spectrogram segmentation task, while requiring significantly less training time, and achieves competitive performance on the CSI-based human activity sensing task. This work demonstrates the effectiveness of ViT with MSM for pretraining as a promising technique for scalable foundation model development in future 6G networks.
翻译:基础深度学习模型是一种通用模型,旨在学习目标模态的通用、鲁棒且适应性强的表征,从而能够在一系列下游任务中进行微调。这些模型通过自监督学习在大规模无标注数据集上进行预训练。相较于传统的监督学习方法,基础模型展现出更优异的泛化能力——这是无线通信领域的关键需求,因为动态环境要求模型具备适应能力。在本研究中,我们提出并论证了视觉Transformer作为频谱图学习无线电基础模型的有效性。我们引入了掩码频谱图建模方法,以自监督方式对ViT进行预训练。我们在两个下游任务上评估了基于ViT的基础模型:基于信道状态信息的人类活动感知和频谱图分割。实验结果表明,该模型在保持跨领域泛化能力的同时,取得了与监督训练相竞争的性能。值得注意的是,预训练的ViT模型在频谱图分割任务上优于参数量四倍于其的从头训练模型,且所需训练时间显著减少;在基于CSI的人类活动感知任务上也取得了具有竞争力的性能。本研究表明,采用MSM预训练的ViT作为未来6G网络中可扩展基础模型开发的前沿技术具有显著潜力。