In high stake applications, active experimentation may be considered too risky and thus data are often collected passively. While in simple cases, such as in bandits, passive and active data collection are similarly effective, the price of passive sampling can be much higher when collecting data from a system with controlled states. The main focus of the current paper is the characterization of this price. For example, when learning in episodic finite state-action Markov decision processes (MDPs) with $\mathrm{S}$ states and $\mathrm{A}$ actions, we show that even with the best (but passively chosen) logging policy, $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H)}/\varepsilon^2)$ episodes are necessary (and sufficient) to obtain an $\epsilon$-optimal policy, where $H$ is the length of episodes. Note that this shows that the sample complexity blows up exponentially compared to the case of active data collection, a result which is not unexpected, but, as far as we know, have not been published beforehand and perhaps the form of the exact expression is a little surprising. We also extend these results in various directions, such as other criteria or learning in the presence of function approximation, with similar conclusions. A remarkable feature of our result is the sharp characterization of the exponent that appears, which is critical for understanding what makes passive learning hard.
翻译:在高风险应用中,主动实验可能被认为风险过高,因此数据通常通过被动方式收集。虽然在简单情形下(如赌博机问题),被动与主动数据收集的效果相似,但从具有受控状态的系统中收集数据时,被动采样的代价可能高得多。本文主要关注这一代价的特征刻画。例如,在学习有限状态-动作的回合制马尔可夫决策过程(MDP)时,其中包含$\mathrm{S}$个状态和$\mathrm{A}$个动作,我们证明即便采用最优的(但被动选择的)日志策略,也需要$\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H)}/\varepsilon^2)$个回合才能获得$\epsilon$-最优策略,其中$H$为回合长度。值得注意的是,这表明样本复杂度相比主动数据收集情况呈指数级增长——这一结果虽非意料之外,但据我们所知此前尚未被公开发表,且其精确表达形式可能有些令人惊讶。我们还将这些结果向多个方向扩展,例如在不同准则下或存在函数近似时的学习问题,并得到相似结论。我们结果的一个显著特征是对所出现指数的精确刻画,这对理解被动学习的困难本质至关重要。