This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed strategies. Our IRL algorithm identifies optimality and then constructs set-valued estimates of the cost function.To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search. As a real-world example, we illustrate using a YouTube dataset comprising metadata from 190000 videos how the proposed IRL method predicts user engagement in online multimedia platforms with high accuracy. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.
翻译:本文提出了一个针对贝叶斯停时问题的逆向强化学习框架。通过观察贝叶斯决策者的行动,我们给出了判断这些行动是否与成本函数优化一致的必要与充分条件。在贝叶斯(部分可观测)设定下,逆向学习者最多能识别出相对于观测策略的最优性。我们的IRL算法可识别最优性,进而构建成本函数的集值估计。为实现这一IRL目标,我们借鉴了源于微观经济学的贝叶斯显示偏好理论新思想。我们通过两个重要停时问题实例(序贯假设检验和贝叶斯搜索)展示了所提IRL方案。作为真实世界案例,我们利用包含19万条元数据的YouTube数据集,验证了所提IRL方法如何以高精度预测在线多媒体平台的用户参与度。最后,针对有限数据集,我们提出了一种IRL检测算法,并给出了其错误概率的有限样本界。