We consider the well-studied dueling bandit problem, where a learner aims to identify near-optimal actions using pairwise comparisons, under the constraint of differential privacy. We consider a general class of utility-based preference matrices for large (potentially unbounded) decision spaces and give the first differentially private dueling bandit algorithm for active learning with user preferences. Our proposed algorithms are computationally efficient with near-optimal performance, both in terms of the private and non-private regret bound. More precisely, we show that when the decision space is of finite size $K$, our proposed algorithm yields order optimal $O\Big(\sum_{i = 2}^K\log\frac{KT}{\Delta_i} + \frac{K}{\epsilon}\Big)$ regret bound for pure $\epsilon$-DP, where $\Delta_i$ denotes the suboptimality gap of the $i$-th arm. We also present a matching lower bound analysis which proves the optimality of our algorithms. Finally, we extend our results to any general decision space in $d$-dimensions with potentially infinite arms and design an $\epsilon$-DP algorithm with regret $\tilde{O} \left( \frac{d^6}{\kappa \epsilon } + \frac{ d\sqrt{T }}{\kappa} \right)$, providing privacy for free when $T \gg d$.
翻译:我们研究了经典的对决式赌博机问题,即学习者需在差分隐私约束下通过成对比较识别近最优动作。我们考虑了一类适用于大规模(可能无界)决策空间的基于效用的偏好矩阵,并提出了首个面向用户偏好主动学习的差分隐私对决式赌博机算法。所提算法在隐私与非隐私遗憾界方面均具有计算高效性与近最优性能。具体而言,当决策空间大小为有限值 $K$ 时,对于纯 $\epsilon$-DP,我们提出的算法实现了阶最优的 $O\Big(\sum_{i = 2}^K\log\frac{KT}{\Delta_i} + \frac{K}{\epsilon}\Big)$ 遗憾界,其中 $\Delta_i$ 表示第 $i$ 个臂的次优性间隙。我们还通过匹配的下界分析证明了算法的最优性。最后,我们将结果推广至任意 $d$ 维一般决策空间(可能包含无限臂),并设计了一个 $\epsilon$-DP 算法,其遗憾为 $\tilde{O} \left( \frac{d^6}{\kappa \epsilon } + \frac{ d\sqrt{T }}{\kappa} \right)$,从而在 $T \gg d$ 时实现了"免费"的隐私保护。