A Bayesian Framework of Deep Reinforcement Learning for Joint O-RAN/MEC Orchestration

Multi-access Edge Computing (MEC) can be implemented together with Open Radio Access Network (O-RAN) over commodity platforms to offer low-cost deployment and bring the services closer to end-users. In this paper, a joint O-RAN/MEC orchestration using a Bayesian deep reinforcement learning (RL)-based framework is proposed that jointly controls the O-RAN functional splits, the allocated resources and hosting locations of the O-RAN/MEC services across geo-distributed platforms, and the routing for each O-RAN/MEC data flow. The goal is to minimize the long-term overall network operation cost and maximize the MEC performance criterion while adapting possibly time-varying O-RAN/MEC demands and resource availability. This orchestration problem is formulated as Markov decision process (MDP). However, the system consists of multiple BSs that share the same resources and serve heterogeneous demands, where their parameters have non-trivial relations. Consequently, finding the exact model of the underlying system is impractical, and the formulated MDP renders in a large state space with multi-dimensional discrete action. To address such modeling and dimensionality issues, a novel model-free RL agent is proposed for our solution framework. The agent is built from Double Deep Q-network (DDQN) that tackles the large state space and is then incorporated with action branching, an action decomposition method that effectively addresses the multi-dimensional discrete action with linear increase complexity. Further, an efficient exploration-exploitation strategy under a Bayesian framework using Thomson sampling is proposed to improve the learning performance and expedite its convergence. Trace-driven simulations are performed using an O-RAN-compliant model. The results show that our approach is data-efficient (i.e., converges faster) and increases the returned reward by 32\% than its non-Bayesian version.

翻译：多接入边缘计算（MEC）可与开放无线接入网（O-RAN）在通用平台上协同实现，从而以低成本部署方式将服务推近终端用户。本文提出一种基于贝叶斯深度强化学习（RL）的O-RAN/MEC联合编排框架，该框架能够统一控制O-RAN功能切分、跨地理分布式平台的资源分配与服务托管位置选择，以及各O-RAN/MEC数据流的路由策略。其目标是在适应O-RAN/MEC动态需求与资源可用性的同时，最小化长期网络运营总成本并最大化MEC性能指标。此编排问题被建模为马尔可夫决策过程（MDP）。然而，系统包含多个共享资源并承载异构需求的基站（BS），其参数之间存在复杂关联，导致精确建模底层系统不切实际，且所构建的MDP具有大规模状态空间与多维度离散动作。为解决建模与维数问题，本文提出一种新颖的无模型RL智能体：该智能体基于双深度Q网络（DDQN）处理大规模状态空间，并集成动作分支（action branching）这一动作分解方法，以线性复杂度增长有效应对多维度离散动作。此外，我们提出基于贝叶斯框架的汤普森采样（Thompson sampling）高效探索-利用策略，以提升学习性能并加速收敛。基于O-RAN兼容模型的轨迹驱动仿真表明，本方法在数据效率（即收敛更快）方面表现优异，且比非贝叶斯版本获得32%的回报增益。