This paper studies Learning from Imperfect Human Feedback (LIHF), motivated by humans' potential irrationality or imperfect perception of true preference. We revisit the classic dueling bandit problem as a model of learning from comparative human feedback, and enrich it by casting the imperfection in human feedback as agnostic corruption to user utilities. We start by identifying the fundamental limits of LIHF and prove a regret lower bound of $\Omega(\max\{T^{1/2},C\})$, even when the total corruption $C$ is known and when the corruption decays gracefully over time (i.e., user feedback becomes increasingly more accurate). We then turn to design robust algorithms applicable in real-world scenarios with arbitrary corruption and unknown $C$. Our key finding is that gradient-based algorithms enjoy a smooth efficiency-robustness tradeoff under corruption by varying their learning rates. Specifically, under general concave user utility, Dueling Bandit Gradient Descent (DBGD) of Yue and Joachims (2009) can be tuned to achieve regret $O(T^{1-\alpha} + T^{ \alpha} C)$ for any given parameter $\alpha \in (0, \frac{1}{4}]$. Additionally, this result enables us to pin down the regret lower bound of the standard DBGD (the $\alpha=1/4$ case) as $\Omega(T^{3/4})$ for the first time, to the best of our knowledge. For strongly concave user utility we show a better tradeoff: there is an algorithm that achieves $O(T^{\alpha} + T^{\frac{1}{2}(1-\alpha)}C)$ for any given $\alpha \in [\frac{1}{2},1)$. Our theoretical insights are corroborated by extensive experiments on real-world recommendation data.
翻译:本文研究从非完美人类反馈中学习(LIHF)的问题,其动机源于人类可能存在非理性行为或对真实偏好的感知偏差。我们重新审视经典的比较学习模型——比较赌博机问题,并通过将人类反馈中的不完美性建模为用户效用的不可知腐败来丰富该模型。首先,我们确定了LIHF的基本极限,并证明即使总腐败量$C$已知且腐败随时间逐渐衰减(即用户反馈日益准确),仍存在$\Omega(\max\{T^{1/2},C\})$的遗憾下界。随后,我们设计适用于任意腐败和未知$C$现实场景的鲁棒算法。关键发现是,基于梯度的算法通过调整学习率能在腐败影响下实现平滑的效率-鲁棒性权衡。具体而言,对于一般凹用户效用,Yue和Joachims(2009)提出的比较赌博机梯度下降算法(DBGD)可通过参数调节达到$O(T^{1-\alpha} + T^{\alpha}C)$的遗憾值($\alpha \in (0, \frac{1}{4}]$)。此外,该结果首次严格确定了标准DBGD(即$\alpha=1/4$情形)的遗憾下界为$\Omega(T^{3/4})$。对于强凹用户效用,我们展示了更优的权衡:存在算法对任意$\alpha \in [\frac{1}{2},1)$可实现$O(T^{\alpha} + T^{\frac{1}{2}(1-\alpha)}C)$的遗憾值。理论分析结果通过真实推荐数据集上的大量实验得到验证。