We propose training fitted Q-iteration with log-loss (FQI-LOG) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-LOG scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving $\textit{small-cost}$ bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-LOG uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
翻译:我们提出使用对数损失训练拟合Q迭代(FQI-LOG)以进行批量强化学习(RL)。我们证明,使用FQI-LOG学习近最优策略所需的样本数量与最优策略的累计成本成比例,而在最优行动达成目标且不产生成本的问题中,该累计成本为零。为此,我们为批量RL中证明$\textit{小成本}$界(即与最优可实现成本成比例的界)提供了通用框架。此外,我们通过实验验证,在最优策略可靠达成目标的问题中,FQI-LOG相比使用平方损失训练的FQI所需的样本更少。