Reinforcement Learning (RL) policies are designed to predict actions based on current observations to maximize cumulative future rewards. In real-world applications (i.e., non-simulated environments), sensors are essential for measuring the current state and providing the observations on which RL policies rely to make decisions. A significant challenge in deploying RL policies in real-world scenarios is handling sensor dropouts, which can result from hardware malfunctions, physical damage, or environmental factors like dust on a camera lens. A common strategy to mitigate this issue is the use of backup sensors, though this comes with added costs. This paper explores the optimization of backup sensor configurations to maximize expected returns while keeping costs below a specified threshold, C. Our approach uses a second-order approximation of expected returns and includes penalties for exceeding cost constraints. We then optimize this quadratic program using Tabu Search, a meta-heuristic algorithm. The approach is evaluated across eight OpenAI Gym environments and a custom Unity-based robotic environment (RobotArmGrasping). Empirical results demonstrate that our quadratic program effectively approximates real expected returns, facilitating the identification of optimal sensor configurations.
翻译:强化学习策略旨在基于当前观测预测动作,以最大化累积未来奖励。在实际应用(即非模拟环境)中,传感器对于测量当前状态并提供强化学习策略决策所依赖的观测数据至关重要。在现实场景中部署强化学习策略面临的一个重大挑战是处理传感器失效问题,这可能是由硬件故障、物理损坏或环境因素(如相机镜头上的灰尘)导致的。缓解此问题的常见策略是使用备用传感器,但这会增加额外成本。本文研究了在将成本控制在指定阈值C以下的同时,通过优化备用传感器配置来最大化期望回报的方法。我们的方法采用期望回报的二阶近似,并对超出成本约束的情况施加惩罚项。随后使用禁忌搜索这一元启发式算法对该二次规划问题进行优化。该方法在八个OpenAI Gym环境及一个基于Unity的自定义机器人环境(RobotArmGrasping)中进行了评估。实证结果表明,我们的二次规划模型能有效逼近真实期望回报,从而有助于确定最优传感器配置方案。