Order Optimal Regret Bounds for Sharpe Ratio Optimization under Thompson Sampling

In this paper, we study sequential decision-making for maximizing the Sharpe ratio (SR) in a stochastic multi-armed bandit (MAB) setting. Unlike standard bandit formulations that maximize cumulative reward, SR optimization requires balancing expected return and reward variability. As a result, the learning objective depends jointly on the mean and variance of the reward distribution and takes a fractional form. To address this problem, we propose the Sharpe Ratio Thompson Sampling \texttt{SRTS}, a Bayesian algorithm for risk-adjusted exploration. For Gaussian reward models, the algorithm employs a Normal-Gamma conjugate posterior to capture uncertainty in both the mean and the precision of each arm. In contrast to additive mean-variance (MV) formulations, which often require different algorithms across risk regimes, the fractional SR objective yields a single sampling rule that applies uniformly across risk tolerances. On the theoretical side, we develop a regret decomposition tailored to the SR objective and introduce a decoupling approach that separates the contributions of mean and variance uncertainty. This framework allows us to control the interaction between the Gaussian mean samples and the Gamma precision samples arising in the posterior. Using these results, we establish a finite-time distribution-dependent $\mathcal{O}(\log n)$ upper bound on the expected regret. We further derive a matching information-theoretic lower bound using a change-of-measure argument, showing that the proposed algorithm is order-optimal. Finally, experiments on synthetic bandit environments illustrate the performance of \texttt{SRTS} and demonstrate improvements over existing risk-aware bandit algorithms across a range of risk-return settings.

翻译：本文研究随机多臂赌博机（MAB）设置下最大化夏普比率（SR）的序贯决策问题。与标准赌博机问题最大化累积奖励不同，SR优化需要平衡期望收益与收益波动性。因此，学习目标依赖于收益分布的均值和方差，并呈现分式形式。为应对该问题，我们提出夏普比率汤普森采样算法\verb|SRTS|，这是一种用于风险调整探索的贝叶斯方法。对于高斯奖励模型，该算法采用正态-伽马共轭后验来捕捉各臂均值与精度的不确定性。与需要针对不同风险区域采用不同算法的可加性均值-方差（MV）公式不同，分式SR目标产生统一的采样规则，可普遍适用于各种风险容忍度。在理论方面，我们针对SR目标定制了遗憾分解，并引入解耦方法分离均值与方差不确定性的贡献。该框架使我们能够控制后验中高斯均值样本与伽马精度样本之间的交互作用。基于这些结果，我们建立了期望遗憾的有限时域分布相关$\mathcal{O}(\log n)$上界。进一步地，我们利用测度变换论证推导出匹配的信息论下界，证明所提算法达到排序最优。最后，在合成赌博机环境上的实验展示了\verb|SRTS|的性能，并表明在多种风险-收益设定下其优于现有风险感知赌博机算法。