Algorithmic decision-making in high-stakes domains often involves assigning decisions to agents with incentives to strategically modify their input to the algorithm. In addition to dealing with incentives, in many domains of interest (e.g. lending and hiring) the decision-maker only observes feedback regarding their policy for rounds in which they assign a positive decision to the agent; this type of feedback is often referred to as apple tasting (or one-sided) feedback. We formalize this setting as an online learning problem with apple-tasting feedback where a principal makes decisions about a sequence of $T$ agents, each of which is represented by a context that may be strategically modified. Our goal is to achieve sublinear strategic regret, which compares the performance of the principal to that of the best fixed policy in hindsight, if the agents were truthful when revealing their contexts. Our main result is a learning algorithm which incurs $\tilde{\mathcal{O}}(\sqrt{T})$ strategic regret when the sequence of agents is chosen stochastically. We also give an algorithm capable of handling adversarially-chosen agents, albeit at the cost of $\tilde{\mathcal{O}}(T^{(d+1)/(d+2)})$ strategic regret (where $d$ is the dimension of the context). Our algorithms can be easily adapted to the setting where the principal receives bandit feedback -- this setting generalizes both the linear contextual bandit problem (by considering agents with incentives) and the strategic classification problem (by allowing for partial feedback).
翻译:在高风险领域中的算法决策通常涉及将决策分配给有动机策略性修改算法输入的代理。除了处理激励问题外,在许多相关领域(如贷款和招聘),决策者仅在给代理分配正面决策的回合中观察到关于其策略的反馈;这种反馈类型通常被称为“苹果品尝”(或单侧)反馈。我们将该场景形式化为一个带有苹果品尝反馈的在线学习问题,其中决策者对由可能被策略性修改的上下文表示的$T$个代理序列做出决策。我们的目标是实现次线性战略懊悔,该指标将决策者的性能与在代理如实披露其上下文情况下事后最优固定策略的性能进行比较。我们的主要成果是一种学习算法,当代理序列随机生成时,该算法产生$\tilde{\mathcal{O}}(\sqrt{T})$的战略懊悔。我们还提出一种能够处理对抗性选择代理的算法,尽管其代价是$\tilde{\mathcal{O}}(T^{(d+1)/(d+2)})$的战略懊悔(其中$d$是上下文的维度)。我们的算法可轻松适应决策者接收赌博机反馈的场景——该场景同时推广了线性上下文赌博机问题(通过考虑具有激励的代理)和策略分类问题(通过允许部分反馈)。