AMBER: An Adaptive Multimodal Mask Transformer for Beam Prediction with Missing Modalities

With the widespread adoption of millimeter-wave (mmWave) massive multi-input-multi-output (MIMO) in vehicular networks, accurate beam prediction and alignment have become critical for high-speed data transmission and reliable access. While traditional beam prediction approaches primarily rely on in-band beam training, recent advances have started to explore multimodal sensing to extract environmental semantics for enhanced prediction. However, the performance of existing multimodal fusion methods degrades significantly in real-world settings because they are vulnerable to missing data caused by sensor blockage, poor lighting, or GPS dropouts. To address this challenge, we propose AMBER ({A}daptive multimodal {M}ask transformer for {BE}am p{R}ediction), a novel end-to-end framework that processes temporal sequences of image, LiDAR, radar, and GPS data, while adaptively handling arbitrary missing-modality cases. AMBER introduces learnable modality tokens and a missing-modality-aware mask to prevent cross-modal noise propagation, along with a learnable fusion token and multihead attention to achieve robust modality-specific information distillation and feature-level fusion. Furthermore, a class-former-aided modality alignment (CMA) module and temporal-aware positional embedding are incorporated to preserve temporal coherence and ensure semantic alignment across modalities, facilitating the learning of modality-invariant and temporally consistent representations for beam prediction. Extensive experiments on the real-world DeepSense6G dataset demonstrate that AMBER significantly outperforms existing multimodal learning baselines. In particular, it maintains high beam prediction accuracy and robustness even under severe missing-modality scenarios, validating its effectiveness and practical applicability.

翻译：随着毫米波大规模多输入多输出技术在车联网中的广泛应用，精确的波束预测与对准已成为实现高速数据传输和可靠接入的关键。传统波束预测方法主要依赖带内波束训练，而近期研究开始探索利用多模态感知提取环境语义以提升预测性能。然而，现有多模态融合方法在实际场景中性能显著下降，因其易受传感器遮挡、光照不足或全球定位系统信号中断等导致的缺失数据影响。为应对该挑战，本文提出AMBER——一种自适应多模态掩码波束预测Transformer框架，该端到端框架可处理图像、激光雷达、雷达和全球定位系统数据的时序序列，同时自适应应对任意模态缺失情况。AMBER引入可学习模态令牌和缺失模态感知掩码以抑制跨模态噪声传播，并结合可学习融合令牌与多头注意力机制实现鲁棒的模态特异性信息蒸馏与特征级融合。此外，文中集成类Former辅助模态对齐模块与时序感知位置嵌入，以保持跨模态时序一致性及语义对齐，促进学习面向波束预测的模态不变性与时序一致性表征。在真实场景DeepSense6G数据集上的大量实验表明，AMBER显著优于现有多模态学习基线方法。特别地，即使在严重模态缺失场景下，其仍能保持高精度波束预测与鲁棒性，验证了该方法的有效性与实际应用价值。