A Reinforcement Learning (RL) system depends on a set of initial conditions (hyperparameters) that affect the system's performance. However, defining a good choice of hyperparameters is a challenging problem. Hyperparameter tuning often requires manual or automated searches to find optimal values. Nonetheless, a noticeable limitation is the high cost of algorithm evaluation for complex models, making the tuning process computationally expensive and time-consuming. In this paper, we propose a framework based on integrating complex event processing and temporal models, to alleviate these trade-offs. Through this combination, it is possible to gain insights about a running RL system efficiently and unobtrusively based on data stream monitoring and to create abstract representations that allow reasoning about the historical behaviour of the RL system. The obtained knowledge is exploited to provide feedback to the RL system for optimising its hyperparameters while making effective use of parallel resources. We introduce a novel history-aware epsilon-greedy logic for hyperparameter optimisation that instead of using static hyperparameters that are kept fixed for the whole training, adjusts the hyperparameters at runtime based on the analysis of the agent's performance over time windows in a single agent's lifetime. We tested the proposed approach in a 5G mobile communications case study that uses DQN, a variant of RL, for its decision-making. Our experiments demonstrated the effects of hyperparameter tuning using history on training stability and reward values. The encouraging results show that the proposed history-aware framework significantly improved performance compared to traditional hyperparameter tuning approaches.
翻译:强化学习(RL)系统依赖于一组影响系统性能的初始条件(超参数)。然而,确定超参数的良好选择是一个具有挑战性的问题。超参数调优通常需要手动或自动搜索来寻找最优值。但一个显著的限制是复杂模型算法评估成本高昂,导致调优过程计算开销大且耗时。本文提出一个基于集成复杂事件处理与时间模型的框架,以缓解这些权衡。通过这种组合,能够基于数据流监控高效且非侵入式地获取运行中RL系统的洞察,并创建可推理RL系统历史行为的抽象表示。所获知识被用于为RL系统提供反馈以优化其超参数,同时有效利用并行资源。我们引入一种新颖的历史感知epsilon-greedy逻辑进行超参数优化:该逻辑并非使用固定不变的静态超参数进行完整训练,而是基于对单个智能体生命周期内不同时间窗口的性能分析,在运行时动态调整超参数。我们在一个采用深度Q网络(DQN,RL的变体)进行决策的5G移动通信案例中测试了所提方法。实验展示了基于历史数据的超参数调优对训练稳定性和奖励值的影响。令人鼓舞的结果表明,与传统超参数调优方法相比,所提历史感知框架显著提升了性能。