We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon problem reduces to optimizing over stationary mixed policies: each weight vector \(w\) on the simplex induces a mixture law \(P^w\), and performance is measured by the concave utility \(U(w)=\mathfrak U(P^w)\). For differentiable statistical utilities, we use influence-function calculus to derive stochastic gradient estimators from bandit feedback. This leads to an entropic mirror-ascent algorithm on a truncated simplex, implemented through multiplicative-weights updates and plug-in estimates of the influence function. We establish regret bounds that separate the mirror-ascent optimization error from the bias caused by estimating the influence function. The framework is developed for general concave distributional utilities and illustrated through variance and Wasserstein objectives, with numerical experiments comparing exact and plug-in influence-function implementations.
翻译:我们研究随机多臂赌博机问题,其中目标函数是长期奖励分布的统计泛函,而不仅是期望奖励。在温和连续性假设下,我们证明无限时域问题可简化为对平稳混合策略的优化:单纯形上的每个权重向量 \(w\) 诱导出混合分布 \(P^w\),性能由凹效用函数 \(U(w)=\mathfrak U(P^w)\) 度量。针对可微统计效用,我们利用影响函数微积分从赌博机反馈中推导随机梯度估计量。由此得到截断单纯形上的熵镜像上升算法,该算法通过乘法权重更新和影响函数的插件估计实现。我们建立了遗憾界,将镜像上升的优化误差与估计影响函数产生的偏差分离开来。该框架适用于一般凹分布效用函数,并通过方差和Wasserstein目标进行示例说明,同时给出了精确影响函数实现与插件影响函数实现的数值实验对比。