We study the tradeoff between consistency and robustness in the context of a single-trajectory time-varying Markov Decision Process (MDP) with untrusted machine-learned advice. Our work departs from the typical approach of treating advice as coming from black-box sources by instead considering a setting where additional information about how the advice is generated is available. We prove a first-of-its-kind consistency and robustness tradeoff given Q-value advice under a general MDP model that includes both continuous and discrete state/action spaces. Our results highlight that utilizing Q-value advice enables dynamic pursuit of the better of machine-learned advice and a robust baseline, thus result in near-optimal performance guarantees, which provably improves what can be obtained solely with black-box advice.
翻译:我们研究了在单轨迹时变马尔可夫决策过程(MDP)中,使用不可信的机器学习建议时一致性与鲁棒性之间的权衡。我们的工作偏离了将建议视为来自黑箱来源的典型方法,转而考虑一种可获得关于建议生成方式的额外信息的场景。我们在一个包含连续和离散状态/动作空间的通用MDP模型下,首次证明了基于Q值建议的一致性与鲁棒性权衡。研究结果表明,利用Q值建议能够动态地追求机器学习建议与鲁棒基线的较优结果,从而获得接近最优的性能保证,这在理论上优于仅依赖黑箱建议所能达到的效果。