We study the sample complexity of learning an $ε$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any algorithm requires at least $Ω(SAB_{\star}^3/(c_{\min}ε^2))$ samples to return an $ε$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{\min} = 0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this lower bound with an algorithm that matches it, up to logarithmic factors, in the general case, and an algorithm that matches it up to logarithmic factors even when $c_{\min} = 0$, but only under the condition that the optimal policy has a bounded hitting time to the goal state.
翻译:我们研究在随机最短路径(SSP)问题中学习一个$ε$-最优策略的样本复杂度。首先,我们推导了当学习器可以访问生成模型时的样本复杂度界。结果表明,存在一个最坏情况的SSP实例,具有$S$个状态、$A$个动作、最小代价$c_{\min}$以及所有状态下最优策略的最大期望代价$B_{\star}$,任何算法至少需要$Ω(SAB_{\star}^3/(c_{\min}ε^2))$个样本才能以高概率返回一个$ε$-最优策略。令人惊讶的是,这意味着当$c_{\min} = 0$时,SSP问题可能是不可学习的,从而揭示出SSP中的学习严格难于有限时域和折扣设置。我们通过一个算法(在一般情况下,最多对数因子地匹配该下界)以及另一个算法(即使当$c_{\min} = 0$时,也最多对数因子地匹配该下界,但仅在最优策略具有到目标状态的有界击中时间条件下成立)来补充这一下界。