Multimodal Transformers for Wireless Communications: A Case Study in Beam Prediction

Wireless communications at high-frequency bands with large antenna arrays face challenges in beam management, which can potentially be improved by multimodality sensing information from cameras, LiDAR, radar, and GPS. In this paper, we present a multimodal transformer deep learning framework for sensing-assisted beam prediction. We employ a convolutional neural network to extract the features from a sequence of images, point clouds, and radar raw data sampled over time. At each convolutional layer, we use transformer encoders to learn the hidden relations between feature tokens from different modalities and time instances over abstraction space and produce encoded vectors for the next-level feature extraction. We train the model on a combination of different modalities with supervised learning. We try to enhance the model over imbalanced data by utilizing focal loss and exponential moving average. We also evaluate data processing and augmentation techniques such as image enhancement, segmentation, background filtering, multimodal data flipping, radar signal transformation, and GPS angle calibration. Experimental results show that our solution trained on image and GPS data produces the best distance-based accuracy of predicted beams at 78.44%, with effective generalization to unseen day scenarios near 73% and night scenarios over 84%. This outperforms using other modalities and arbitrary data processing techniques, which demonstrates the effectiveness of transformers with feature fusion in performing radio beam prediction from images and GPS. Furthermore, our solution could be pretrained from large sequences of multimodality wireless data, on fine-tuning for multiple downstream radio network tasks.

翻译：高频段大规模天线阵列无线通信面临波束管理挑战，利用来自摄像头、激光雷达、雷达和GPS的多模态传感信息可潜在改善这一问题。本文提出一种面向感知辅助波束预测的多模态Transformer深度学习框架。我们采用卷积神经网络从按时间序列采样的图像、点云和雷达原始数据中提取特征。在每个卷积层，使用Transformer编码器学习不同模态与时间维度的特征令牌在抽象空间中的隐藏关系，并生成用于下一级特征提取的编码向量。通过监督学习在多种模态组合上训练模型，并利用焦点损失和指数移动平均法缓解数据不平衡问题。同时评估了图像增强、分割、背景滤波、多模态数据翻转、雷达信号变换及GPS角度校准等数据处理与增强技术。实验表明，基于图像和GPS数据训练的解决方案在波束预测的基于距离的准确率上达到78.44%，对未见日间场景泛化能力约73%，夜间场景超过84%。该性能优于其他模态组合及任意数据处理技术，充分验证了基于特征融合的Transformer在通过图像和GPS进行无线电波束预测中的有效性。此外，本方案可通过大规模多模态无线数据序列进行预训练，并微调适配多种下游无线网络任务。