Reward Selection with Noisy Observations

We study a fundamental problem in optimization under uncertainty. There are $n$ boxes; each box $i$ contains a hidden reward $x_i$. Rewards are drawn i.i.d. from an unknown distribution $\mathcal{D}$. For each box $i$, we see $y_i$, an unbiased estimate of its reward, which is drawn from a Normal distribution with known standard deviation $\sigma_i$ (and an unknown mean $x_i$). Our task is to select a single box, with the goal of maximizing our reward. This problem captures a wide range of applications, e.g. ad auctions, where the hidden reward is the click-through rate of an ad. Previous work in this model [BKMR12] proves that the naive policy, which selects the box with the largest estimate $y_i$, is suboptimal, and suggests a linear policy, which selects the box $i$ with the largest $y_i - c \cdot \sigma_i$, for some $c > 0$. However, no formal guarantees are given about the performance of either policy (e.g., whether their expected reward is within some factor of the optimal policy's reward). In this work, we prove that both the naive policy and the linear policy are arbitrarily bad compared to the optimal policy, even when $\mathcal{D}$ is well-behaved, e.g. has monotone hazard rate (MHR), and even under a "small tail" condition, which requires that not too many boxes have arbitrarily large noise. On the flip side, we propose a simple threshold policy that gives a constant approximation to the reward of a prophet (who knows the realized values $x_1, \dots, x_n$) under the same "small tail" condition. We prove that when this condition is not satisfied, even an optimal clairvoyant policy (that knows $\mathcal{D}$) cannot get a constant approximation to the prophet, even for MHR distributions, implying that our threshold policy is optimal against the prophet benchmark, up to constants.

翻译：我们研究不确定性优化中的一个基本问题。假设有 $n$ 个盒子；每个盒子 $i$ 包含一个隐藏奖励 $x_i$。奖励来自未知分布 $\mathcal{D}$ 的独立同分布样本。对于每个盒子 $i$，我们观测到其奖励的无偏估计 $y_i$，该估计从已知标准差 $\sigma_i$（且未知均值 $x_i$）的正态分布中抽取。我们的任务是选择一个盒子，以最大化获得的奖励。该问题涵盖广泛的应用场景，例如广告拍卖，其中隐藏奖励为广告的点击率。先前在该模型中的工作 [BKMR12] 证明了朴素策略（选择估计值 $y_i$ 最大的盒子）是次优的，并提出一种线性策略：对于某个 $c > 0$，选择 $y_i - c \cdot \sigma_i$ 最大的盒子 $i$。然而，这两种策略的性能均无形式化保证（例如，其期望奖励是否在最优策略奖励的某个因子范围内）。在本研究中，我们证明即使 $\mathcal{D}$ 具有良好的性质（例如具有单调风险率（MHR）），且即使满足“小尾部”条件（即要求没有过多盒子具有任意大的噪声），朴素策略和线性策略与最优策略相比均可能任意差。另一方面，我们提出一种简单的阈值策略，该策略在同一“小尾部”条件下，可对先知（能获知实际实现值 $x_1, \dots, x_n$）的奖励实现常数近似。我们证明，当该条件不满足时，即使最优先见策略（已知 $\mathcal{D}$）也无法对先知奖励实现常数近似（即使对 MHR 分布也是如此），这表明我们的阈值策略在对比先知基准时（至多相差常数因子）是最优的。