We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon problem reduces to optimizing over stationary mixed policies: each weight vector \(w\) on the simplex induces a mixture law \(P^w\), and performance is measured by the concave utility \(U(w)=\mathfrak U(P^w)\). For differentiable statistical utilities, we use influence-function calculus to derive stochastic gradient estimators from bandit feedback. This leads to an entropic mirror-ascent algorithm on a truncated simplex, implemented through multiplicative-weights updates and plug-in estimates of the influence function. We establish regret bounds that separate the mirror-ascent optimization error from the bias caused by estimating the influence function. The framework is developed for general concave distributional utilities and illustrated through variance and Wasserstein objectives, with numerical experiments comparing exact and plug-in influence-function implementations.
翻译:我们研究随机多臂赌博机问题,其目标不仅是期望奖励,更是长期奖励分布的统计泛函。在温和的连续性假设下,我们证明无限时域问题可简化为对平稳混合策略的优化:单纯形上的每个权重向量 \(w\) 诱导一个混合分布 \(P^w\),性能由凹性效用 \(U(w)=\mathfrak U(P^w)\) 度量。对于可微的统计效用函数,我们利用影响函数微积分从赌博机反馈中推导出随机梯度估计量。由此,我们在截断单纯形上提出一种熵镜像下降算法,该算法通过乘法权重更新和影响函数的插件估计实现。我们建立的遗憾上界将镜像下降的优化误差与估计影响函数引起的偏差分离开来。该框架适用于一般凹性分布效用函数,并通过方差和Wasserstein目标进行说明,同时通过数值实验比较了精确与插件影响函数实现的性能。