Sequential learning problems are common in several fields of research and practical applications. Examples include dynamic pricing and assortment, design of auctions and incentives and permeate a large number of sequential treatment experiments. In this paper, we extend one of the most popular learning solutions, the $\epsilon_t$-greedy heuristics, to high-dimensional contexts considering a conservative directive. We do this by allocating part of the time the original rule uses to adopt completely new actions to a more focused search in a restrictive set of promising actions. The resulting rule might be useful for practical applications that still values surprises, although at a decreasing rate, while also has restrictions on the adoption of unusual actions. With high probability, we find reasonable bounds for the cumulative regret of a conservative high-dimensional decaying $\epsilon_t$-greedy rule. Also, we provide a lower bound for the cardinality of the set of viable actions that implies in an improved regret bound for the conservative version when compared to its non-conservative counterpart. Additionally, we show that end-users have sufficient flexibility when establishing how much safety they want, since it can be tuned without impacting theoretical properties. We illustrate our proposal both in a simulation exercise and using a real dataset.
翻译:序列学习问题在多个研究领域和实际应用中普遍存在,例如动态定价与产品组合、拍卖与激励设计,以及大量序贯治疗实验。本文在考虑保守策略的前提下,将最流行的学习方案之一——$\epsilon_t$-贪心启发式方法——扩展至高维场景。具体而言,我们将原始规则中用于完全探索全新行动的部分时间,分配给在受限的优势行动集合中进行的更聚焦搜索。由此产生的规则可能适用于那些仍重视意外发现(尽管其概率递减)且同时限制异常行动采纳的实际应用。我们在高概率条件下,为保守型高维衰减$\epsilon_t$-贪心规则的累积遗憾找到了合理边界。同时,我们给出了可行行动集合基数的下界,该下界表明保守版本相比非保守版本具有更优的遗憾界。此外,我们证明最终用户可根据安全需求灵活调整参数,且不影响理论性质。我们通过模拟实验和真实数据集验证了所提方法。