For low-altitude economy (LAE), fast and accurate beam prediction between high-mobility unmanned aerial vehicles (UAVs) and ground base stations is of paramount importance, which ensures seamless coverage and reliable communications. However, existing deep learning-based beam prediction methods lack high-level semantic understanding of dynamic environments, resulting in poor generalization. On the other hand, the emerging large language model (LLM) based approaches show promise in enhancing generalization, but they typically lack rich environmental perception, thereby failing to capture fine-grained spatial semantics essential for precise beam alignment. To tackle these limitations, we propose in this correspondence a novel end-to-end generative framework for beam prediction, called BeamVLM, which treats beam prediction as a vision question answering task capitalizing on powerful existing vision-language models (VLMs). By projecting raw visual patches directly into the language domain and judiciously designing an instructional prompt, the proposed BeamVLM enables the VLM to jointly reason over UAV trajectories and environmental context. Last, experimental results on real-world datasets demonstrate that the proposed BeamVLM outperforms state-of-the-art methods in prediction accuracy and also exhibits superior generalization for other scenarios such as vehicle-to-infrastructure (V2I) beam prediction.
翻译:在低空经济中,实现高机动性无人机与地面基站之间快速、准确的波束预测至关重要,这确保了无缝覆盖与可靠通信。然而,现有基于深度学习的波束预测方法缺乏对动态环境的高层语义理解,导致泛化能力不足。另一方面,新兴的基于大语言模型的方法虽在提升泛化能力方面展现出潜力,但其通常缺乏丰富的环境感知能力,因而难以捕获精确波束对齐所需的关键细粒度空间语义。为应对这些局限,本文提出一种新颖的端到端生成式波束预测框架,称为BeamVLM。该框架将波束预测视为一项视觉问答任务,充分利用现有强大的视觉语言模型。通过将原始视觉图像块直接投影至语言域,并精心设计指令提示,所提出的BeamVLM使VLM能够对无人机轨迹与环境上下文进行联合推理。最后,在真实数据集上的实验结果表明,所提出的BeamVLM在预测准确性上优于现有最优方法,并在其他场景(如车对基础设施波束预测)中展现出更优的泛化性能。