Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients and thereby improve average-cost performance. An example is the stochastic shortest path problem with random edge costs (e.g., traffic) and contextual features (e.g., lagged traffic, weather). Existing work on CLO assumes the data has fully observed cost coefficient vectors, but in many applications, we can only see the realized cost of a historical decision, that is, just one projection of the random cost coefficient vector, to which we refer as bandit feedback. We study a class of offline learning algorithms for CLO with bandit feedback, which we term induced empirical risk minimization (IERM), where we fit a predictive model to directly optimize the downstream performance of the policy it induces. We show a fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of the optimization estimate, and we develop computationally tractable surrogate losses. A byproduct of our theory of independent interest is fast-rate regret bound for IERM with full feedback and misspecified policy class. We compare the performance of different modeling choices numerically using a stochastic shortest path example and provide practical insights from the empirical results.
翻译:上下文线性优化(CLO)利用预测性上下文特征来降低随机成本系数的不确定性,从而提升平均成本性能。一个典型例子是具有随机边成本(例如交通状况)和上下文特征(例如滞后交通、天气)的随机最短路径问题。现有关于CLO的研究假设数据具有完全观测的成本系数向量,但在许多实际应用中,我们只能观测到历史决策所实现的成本,即随机成本系数向量的单一投影,我们称之为赌博机反馈。我们研究了一类针对具有赌博机反馈的CLO的离线学习算法,称为诱导经验风险最小化(IERM),该方法通过拟合预测模型来直接优化其诱导策略的下游性能。我们证明了IERM具有快速收敛的遗憾界,该界允许模型类别设定错误并支持优化估计量的灵活选择,同时开发了计算可处理的代理损失函数。我们理论的一个独立副产品是针对完全反馈和错误设定策略类别的IERM快速遗憾界。我们通过随机最短路径案例对不同建模选择的性能进行了数值比较,并从实证结果中提炼出实用见解。