An agent's ability to leverage past experience is critical for efficiently solving new tasks. Prior work has focused on using value function estimates to obtain zero-shot approximations for solutions to a new task. In soft Q-learning, we show how any value function estimate can also be used to derive double-sided bounds on the optimal value function. The derived bounds lead to new approaches for boosting training performance which we validate experimentally. Notably, we find that the proposed framework suggests an alternative method for updating the Q-function, leading to boosted performance.
翻译:智能体利用过往经验的能力对于高效解决新任务至关重要。先前的研究主要集中于使用价值函数估计来获得新任务解决方案的零样本近似。在软Q学习中,我们展示了任何价值函数估计如何也能用于推导最优价值函数的双侧边界。所推导的边界催生了提升训练性能的新方法,我们通过实验验证了其有效性。值得注意的是,我们发现所提出的框架为更新Q函数提供了一种替代方法,从而实现了性能提升。