Constrained partially observable Markov decision processes (CPOMDPs) have been used to model various real-world phenomena. However, they are notoriously difficult to solve to optimality, and there exist only a few approximation methods for obtaining high-quality solutions. In this study, grid-based approximations are used in combination with linear programming (LP) models to generate approximate policies for CPOMDPs. A detailed numerical study is conducted with six CPOMDP problem instances considering both their finite and infinite horizon formulations. The quality of approximation algorithms for solving unconstrained POMDP problems is established through a comparative analysis with exact solution methods. Then, the performance of the LP-based CPOMDP solution approaches for varying budget levels is evaluated. Finally, the flexibility of LP-based approaches is demonstrated by applying deterministic policy constraints, and a detailed investigation into their impact on rewards and CPU run time is provided. For most of the finite horizon problems, deterministic policy constraints are found to have little impact on expected reward, but they introduce a significant increase to CPU run time. For infinite horizon problems, the reverse is observed: deterministic policies tend to yield lower expected total rewards than their stochastic counterparts, but the impact of deterministic constraints on CPU run time is negligible in this case. Overall, these results demonstrate that LP models can effectively generate approximate policies for both finite and infinite horizon problems while providing the flexibility to incorporate various additional constraints into the underlying model.
翻译:约束部分可观测马尔可夫决策过程(CPOMDPs)已被用于建模多种现实世界现象。然而,求解其最优解极为困难,目前仅有少量近似方法可获得高质量解。本研究将网格近似与线性规划(LP)模型相结合,为CPOMDPs生成近似策略。通过对六个CPOMDP问题实例同时考虑有限与无限时域公式进行详细数值研究。通过精确求解方法的对比分析,确立了无约束POMDP问题近似算法的质量。随后,评估了基于LP的CPOMDP求解方法在不同预算水平下的性能。最后,通过应用确定性策略约束展示了LP方法的灵活性,并深入研究了这些约束对奖励和CPU运行时间的影响。对于大多数有限时域问题,确定性策略约束对期望奖励影响甚微,但显著增加了CPU运行时间。而在无限时域问题中观察到相反现象:确定性策略产生的期望总奖励通常低于随机策略,但在此情况下确定性约束对CPU运行时间的影响可忽略不计。总体而言,这些结果表明LP模型能够有效生成有限和无限时域问题的近似策略,同时具备向基础模型中灵活添加各种额外约束的能力。