We propose training fitted Q-iteration with log-loss (FQI-LOG) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-LOG scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving $\textit{small-cost}$ bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-LOG uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
翻译:我们提出使用对数损失训练拟合Q迭代(FQI-LOG)进行批量强化学习(RL)。我们证明,使用FQI-LOG学习接近最优策略所需的样本数量与最优策略的累积成本成比例,而在最优行为能达成目标且不产生成本的问题中,该成本为零。为此,我们提供了一个通用框架,用于证明批量RL中的$\textit{小成本}$界限,即与最优可实现成本成比例的界限。此外,我们通过实验验证,在最优策略能可靠达成目标的问题上,FQI-LOG所需的样本数少于使用平方损失训练的FQI。