When building datasets, one needs to invest time, money and energy to either aggregate more data or to improve their quality. The most common practice favors quantity over quality without necessarily quantifying the trade-off that emerges. In this work, we study data-driven contextual decision-making and the performance implications of quality and quantity of data. We focus on contextual decision-making with a Newsvendor loss. This loss is that of a central capacity planning problem in Operations Research, but also that associated with quantile regression. We consider a model in which outcomes observed in similar contexts have similar distributions and analyze the performance of a classical class of kernel policies which weigh data according to their similarity in a contextual space. We develop a series of results that lead to an exact characterization of the worst-case expected regret of these policies. This exact characterization applies to any sample size and any observed contexts. The model we develop is flexible, and captures the case of partially observed contexts. This exact analysis enables to unveil new structural insights on the learning behavior of uniform kernel methods: i) the specialized analysis leads to very large improvements in quantification of performance compared to state of the art general purpose bounds. ii) we show an important non-monotonicity of the performance as a function of data size not captured by previous bounds; and iii) we show that in some regimes, a little increase in the quality of the data can dramatically reduce the amount of samples required to reach a performance target. All in all, our work demonstrates that it is possible to quantify in a precise fashion the interplay of data quality and quantity, and performance in a central problem class. It also highlights the need for problem specific bounds in order to understand the trade-offs at play.
翻译:在构建数据集时,需要投入时间、金钱和精力来增加数据量或提升数据质量。当前普遍做法倾向于优先考虑数量而非质量,却未能量化两者间的权衡关系。本研究针对数据驱动的情境决策,探究数据质量与数量对性能的影响。我们聚焦于采用报童损失函数的情境决策问题——该损失函数不仅存在于运筹学中的核心产能规划问题,也与分位数回归相关。我们构建了一个模型,假设相似情境下的观测结果具有相似分布,并分析了一类经典核策略的性能表现。这类策略依据情境空间中的相似性对数据赋予权重。通过一系列推导,我们获得了该类策略最坏情况期望遗憾的精确表达式。该表达式适用于任意样本量和任意观测情境。我们提出的模型具有灵活性,能够处理部分情境缺失的情况。这一精确分析揭示了均匀核方法学习行为的新结构洞见:(i)相较于现有通用上界,专门性分析使性能量化取得显著改善;(ii)发现了先前上界未能捕捉的性能随数据量变化的非单调特性;(iii)在某些条件下,数据质量的微小提升可大幅减少达到性能目标所需样本量。总之,本研究证明了在核心问题类别中精确定量分析数据质量、数量与性能三者交互作用的可能性,同时强调了理解相关权衡关系时建立问题专属上界的必要性。