We study contextual bandit (CB) problems, where the user can sometimes respond with the best action in a given context. Such an interaction arises, for example, in text prediction or autocompletion settings, where a poor suggestion is simply ignored and the user enters the desired text instead. Crucially, this extra feedback is user-triggered on only a subset of the contexts. We develop a new framework to leverage such signals, while being robust to their biased nature. We also augment standard CB algorithms to leverage the signal, and show improved regret guarantees for the resulting algorithms under a variety of conditions on the helpfulness of and bias inherent in this feedback.
翻译:我们研究上下文赌博机问题,其中用户有时能在给定上下文中响应最优动作。此类交互出现在例如文本预测或自动补全场景中——当建议不佳时,用户会直接忽略并输入所需文本。关键在于,这种额外反馈仅由用户在部分上下文中触发。我们开发了一个新型框架来利用此类信号,同时对其偏倚特性保持鲁棒性。我们还改进了标准上下文赌博机算法以利用该信号,并在关于此反馈的助益性与内在偏倚的各种条件下,证明了改进算法具有更优的遗憾界。