The combination of multiple-input multiple-output (MIMO) systems and intelligent reflecting surfaces (IRSs) is foreseen as a critical enabler of beyond 5G (B5G) and 6G. In this work, two different approaches are considered for the joint optimization of the IRS phase-shift matrix and MIMO precoders of an IRS-assisted multi-stream (MS) multi-user MIMO (MU-MIMO) system. Both approaches aim to maximize the system sum-rate for every channel realization. The first proposed solution is a novel contextual bandit (CB) framework with continuous state and action spaces called deep contextual bandit-oriented deep deterministic policy gradient (DCB-DDPG). The second is an innovative deep reinforcement learning (DRL) formulation where the states, actions, and rewards are selected such that the Markov decision process (MDP) property of reinforcement learning (RL) is appropriately met. Both proposals perform remarkably better than state-of-the-art heuristic methods in scenarios with high multi-user interference.
翻译:多输入多输出(MIMO)系统与智能反射面(IRS)的结合被预见为超越5G(B5G)和6G的关键使能技术。本研究针对IRS辅助的多流(MS)多用户MIMO(MU-MIMO)系统,考虑两种不同方法以实现IRS相移矩阵与MIMO预编码器的联合优化。两种方法均旨在针对每个信道实现最大化系统总速率。第一种方案提出一种新颖的上下文Bandit(CB)框架,具有连续状态与动作空间,称为深度上下文Bandit导向的深度确定性策略梯度(DCB-DDPG)。第二种方案是一种创新的深度强化学习(DRL)公式,其通过精心选择状态、动作与奖励,恰当满足强化学习(RL)的马尔可夫决策过程(MDP)特性。在高多用户干扰场景下,两种方案的表现均显著优于现有最优启发式方法。