For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in a feature-based $Q$-learning method with multipattern $Q$-factor approximation and we prove a high-probability regret bound of $\mathcal{O}\big(H^2 N^H \sqrt{ K}\big)$, where $H$ is the horizon, $N$ is the mini-batch size, and $K$ is the number of episodes. We also propose an economical version of the $Q$-learning method that streamlines the policy evaluation (backward) step. The theoretical results are illustrated on a stochastic assignment problem and a short-horizon multi-armed bandit problem.
翻译:针对风险厌恶有限时域马尔可夫决策问题,我们引入一类特殊的马尔可夫一致风险度量——小批量度量。同时定义泛化线性系统类别的多模式风险厌恶问题类别。我们将这两个概念应用于基于特征的$Q$-学习方法中,该方法采用多模式$Q$-因子近似,并证明了高概率遗憾界为$\mathcal{O}\big(H^2 N^H \sqrt{ K}\big)$,其中$H$表示时域长度,$N$为小批量规模,$K$为轮次数量。我们还提出了一种精简策略评估(反向)步骤的经济型$Q$-学习方法。在随机指派问题和短时域多臂老虎机问题上的数值实验验证了理论结果。