We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(log(T))$ calls to an offline regression oracle over $T$ rounds, and makes $O(loglog(T))$ calls when $T$ is known. The design of OE2D algorithm generalizes Falcon~\citep{simchi2022bypassing} and its linear reward version~\citep[][Section 4]{xu2020upper} in that it chooses an action distribution that we term ``exploitative F-design'' that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and Decision Estimation Coefficient (DEC)~\citep{foster2021statistical}, bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.
翻译:我们提出了一个算法框架——离线估计到决策(OE2D),它将具有一般奖励函数逼近的情境赌博学习问题约简为离线回归问题。该框架允许在大动作空间的情境赌博中实现近乎最优的遗憾,在$T$轮中仅需调用离线回归预言机$O(log(T))$次,且当$T$已知时仅需$O(loglog(T))$次调用。OE2D算法的设计推广了Falcon~\citep{simchi2022bypassing}及其线性奖励版本~\citep[][第4节]{xu2020upper},它选择了一种我们称之为“利用性F设计”的动作分布,该分布同时保证了低遗憾和良好的覆盖度,从而在探索与利用之间取得平衡。我们遗憾分析的核心是一个新的复杂度度量——决策-离线估计系数(DOEC),我们证明了在每情境有界埃尔鲁德维度和平滑遗憾设置下该系数是有界的。我们还建立了DOEC与决策估计系数(DEC)~\citep{foster2021statistical}之间的联系,首次为离线和在线预言机高效的情境赌博算法搭建了设计原则的桥梁。