We study the tradeoff between consistency and robustness in the context of a single-trajectory time-varying Markov Decision Process (MDP) with untrusted machine-learned advice. Our work departs from the typical approach of treating advice as coming from black-box sources by instead considering a setting where additional information about how the advice is generated is available. We prove a first-of-its-kind consistency and robustness tradeoff given Q-value advice under a general MDP model that includes both continuous and discrete state/action spaces. Our results highlight that utilizing Q-value advice enables dynamic pursuit of the better of machine-learned advice and a robust baseline, thus result in near-optimal performance guarantees, which provably improves what can be obtained solely with black-box advice.
翻译:我们研究在单轨迹时变马尔可夫决策过程(MDP)中,结合不可信机器学习建议时的一致性与鲁棒性权衡。不同于将建议视为黑箱来源的典型方法,本文考虑可获得关于建议生成方式的额外信息这一场景。针对包含连续/离散状态与动作空间的通用MDP模型,我们基于Q值建议首次证明了此类一致性与鲁棒性权衡关系。研究结果表明,利用Q值建议能够动态追踪机器学习建议与鲁棒基线的较优者,从而获得近乎最优的性能保证,这相比仅依赖黑箱建议的方法具有理论可证明的提升。