We study the interplay between the data distribution and Q-learning-based algorithms with function approximation. We provide a unified theoretical and empirical analysis as to how different properties of the data distribution influence the performance of Q-learning-based algorithms. We connect different lines of research, as well as validate and extend previous results. We start by reviewing theoretical bounds on the performance of approximate dynamic programming algorithms. We then introduce a novel four-state MDP specifically tailored to highlight the impact of the data distribution in the performance of Q-learning-based algorithms with function approximation, both online and offline. Finally, we experimentally assess the impact of the data distribution properties on the performance of two offline Q-learning-based algorithms under different environments. According to our results: (i) high entropy data distributions are well-suited for learning in an offline manner; and (ii) a certain degree of data diversity (data coverage) and data quality (closeness to optimal policy) are jointly desirable for offline learning.
翻译:我们研究了数据分布与基于Q-learning且采用函数逼近的算法之间的相互作用。我们提供了统一的理论与实证分析,探讨数据分布的不同属性如何影响基于Q-learning算法的性能。我们连接了不同的研究脉络,同时验证并扩展了前人的研究结果。首先,我们回顾了近似动态规划算法性能的理论界限。接着,我们引入了一个专门设计的新型四状态马尔可夫决策过程(MDP),以突出数据分布对在线和离线场景下基于Q-learning且采用函数逼近的算法性能的影响。最后,我们通过实验评估了不同环境中数据分布属性对两种离线Q-learning算法性能的影响。根据我们的结果:(i)高熵数据分布非常适合离线学习;(ii)一定程度的数据多样性(数据覆盖范围)和数据质量(接近最优策略)对于离线学习而言是共同期望的。