Recent advances in learning techniques have garnered attention for their applicability to a diverse range of real-world sequential decision-making problems. Yet, many practical applications have critical constraints for operation in real environments. Most learning solutions often neglect the risk of failing to meet these constraints, hindering their implementation in real-world contexts. In this paper, we propose a risk-aware decision-making framework for contextual bandit problems, accommodating constraints and continuous action spaces. Our approach employs an actor multi-critic architecture, with each critic characterizing the distribution of performance and constraint metrics. Our framework is designed to cater to various risk levels, effectively balancing constraint satisfaction against performance. To demonstrate the effectiveness of our approach, we first compare it against state-of-the-art baseline methods in a synthetic environment, highlighting the impact of intrinsic environmental noise across different risk configurations. Finally, we evaluate our framework in a real-world use case involving a 5G mobile network where only our approach consistently satisfies the system constraint (a signal processing reliability target) with a small performance toll (8.5% increase in power consumption).
翻译:学习技术的最新进展因其在多种现实世界序贯决策问题中的适用性而备受关注。然而,许多实际应用在真实环境运行中存在关键约束。大多数学习解决方案往往忽略违反这些约束的风险,阻碍了它们在现实场景中的部署。本文提出了一种针对上下文赌博机问题的风险感知决策框架,可适应约束条件和连续动作空间。我们的方法采用演员-多评论家架构,每个评论家描述性能指标与约束指标的分布。该框架旨在适应不同风险水平,有效平衡约束满足与性能表现。为展示方法有效性,我们首先在合成环境中与最先进的基线方法进行对比,突出不同风险配置下固有环境噪声的影响。最后,我们在涉及5G移动网络的真实用例中评估该框架——只有我们的方法能始终满足系统约束(信号处理可靠性目标),且仅产生轻微性能代价(功耗增加8.5%)。