Interactive-Grounded Learning (IGL) [Xie et al., 2021] is a powerful framework in which a learner aims at maximizing unobservable rewards through interacting with an environment and observing reward-dependent feedback on the taken actions. To deal with personalized rewards that are ubiquitous in applications such as recommendation systems, Maghakian et al. [2022] study a version of IGL with context-dependent feedback, but their algorithm does not come with theoretical guarantees. In this work, we consider the same problem and provide the first provably efficient algorithms with sublinear regret under realizability. Our analysis reveals that the step-function estimator of prior work can deviate uncontrollably due to finite-sample effects. Our solution is a novel Lipschitz reward estimator which underestimates the true reward and enjoys favorable generalization performances. Building on this estimator, we propose two algorithms, one based on explore-then-exploit and the other based on inverse-gap weighting. We apply IGL to learning from image feedback and learning from text feedback, which are reward-free settings that arise in practice. Experimental results showcase the importance of using our Lipschitz reward estimator and the overall effectiveness of our algorithms.
翻译:交互式基础学习(IGL)[Xie等人,2021]是一个强大的框架,学习者旨在通过与环境的交互并观察所采取动作的奖励依赖性反馈,来最大化不可观测的奖励。为了处理在推荐系统等应用中普遍存在的个性化奖励,Maghakian等人[2022]研究了一个具有上下文依赖性反馈的IGL版本,但他们的算法缺乏理论保证。在本工作中,我们考虑相同的问题,并在可实现性假设下,首次提供了具有次线性遗憾的可证明高效算法。我们的分析表明,先前工作的阶跃函数估计器可能因有限样本效应而发生不可控的偏差。我们的解决方案是一种新颖的Lipschitz奖励估计器,它会低估真实奖励并具有良好的泛化性能。基于此估计器,我们提出了两种算法,一种基于“先探索后利用”策略,另一种基于逆间隔加权策略。我们将IGL应用于从图像反馈中学习和从文本反馈中学习,这些是在实践中出现的无奖励设置。实验结果展示了使用我们的Lipschitz奖励估计器的重要性以及我们算法的整体有效性。