Learning to Reflect: Hierarchical Multi-Agent Reinforcement Learning for CSI-Free mmWave Beam-Focusing

Reconfigurable Intelligent Surfaces promise to transform wireless environments, yet practical deployment is hindered by the prohibitive overhead of Channel State Information (CSI) estimation and the dimensionality explosion inherent in centralized optimization. This paper proposes a Hierarchical Multi-Agent Reinforcement Learning (HMARL) framework for the control of mechanically reconfigurable reflective surfaces in millimeter-wave (mmWave) systems. We introduce a "CSI-free" paradigm that substitutes pilot-based channel estimation with readily available user localization data. To manage the massive combinatorial action space, the proposed architecture utilizes Multi-Agent Proximal Policy Optimization (MAPPO) under a Centralized Training with Decentralized Execution (CTDE) paradigm. The proposed architecture decomposes the control problem into two abstraction levels: a high-level controller for user-to-reflector allocation and decentralized low-level controllers for low-level focal point optimization. Comprehensive ray-tracing evaluations demonstrate that the framework achieves 2.81-7.94 dB RSSI improvements over centralized baselines, with the performance advantage widening as system complexity increases. Scalability analysis reveals that the system maintains sustained efficiency, exhibiting minimal per-user performance degradation and stable total power utilization even when user density doubles. Furthermore, robustness validation confirms the framework's viability across varying reflector aperture sizes (45-99 tiles) and demonstrates graceful performance degradation under localization errors up to 0.5 m. By eliminating CSI overhead while maintaining high-fidelity beam-focusing, this work establishes HMARL as a practical solution for intelligent mmWave environments.

翻译：可重构智能表面有望变革无线通信环境，但其实际部署受到信道状态信息估计的过高开销以及集中式优化固有的维度爆炸问题的阻碍。本文提出一种分层多智能体强化学习框架，用于控制毫米波系统中的机械可重构反射表面。我们引入一种“免CSI”范式，用易于获取的用户定位数据替代基于导频的信道估计。为管理巨大的组合动作空间，所提架构在集中训练分散执行范式下采用多智能体近端策略优化算法。该架构将控制问题分解为两个抽象层级：用于用户-反射器分配的高层控制器，以及用于低层焦点优化的分散式低层控制器。全面的射线追踪评估表明，该框架相比集中式基线实现了2.81-7.94 dB的接收信号强度指示提升，且随着系统复杂度增加，性能优势持续扩大。可扩展性分析表明系统能保持持续效率，在用户密度翻倍时仍呈现极低的单用户性能衰减和稳定的总功率利用率。鲁棒性验证进一步证实了框架在不同反射面孔径尺寸（45-99个单元）下的可行性，并在高达0.5米的定位误差下表现出平缓的性能衰减。通过消除CSI开销同时保持高精度波束聚焦，本工作确立了HMARL作为智能毫米波环境实用解决方案的地位。