We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon problem reduces to optimizing over stationary mixed policies: each weight vector \(w\) on the simplex induces a mixture law \(P^w\), and performance is measured by the concave utility \(U(w)=\mathfrak U(P^w)\). For differentiable statistical utilities, we use influence-function calculus to derive stochastic gradient estimators from bandit feedback. This leads to an entropic mirror-ascent algorithm on a truncated simplex, implemented through multiplicative-weights updates and plug-in estimates of the influence function. We establish regret bounds that separate the mirror-ascent optimization error from the bias caused by estimating the influence function. The framework is developed for general concave distributional utilities and illustrated through variance and Wasserstein objectives, with numerical experiments comparing exact and plug-in influence-function implementations.
翻译:我们研究随机多臂赌博机问题,其目标函数是长期奖励分布的统计泛函,而非仅期望奖励。在温和连续性假设下,我们证明无限时域问题可简化为在平稳混合策略上的优化:单纯形上的每个权重向量 \(w\) 诱导出一个混合分布 \(P^w\),性能由凹效用 \(U(w)=\mathfrak U(P^w)\) 度量。对于可微统计效用,我们使用影响函数微积分从赌博机反馈中推导出随机梯度估计器。由此得到在截断单纯形上的熵镜像下降算法,通过乘性权重更新和影响函数的插件估计实现。我们建立了遗憾界,将镜像下降优化误差与估计影响函数引起的偏差分开。该框架针对一般凹分布效用开发,并通过方差和Wasserstein目标进行说明,数值实验比较了精确和插件影响函数实现。