The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.
翻译:复杂决策与语言建模问题的日益关注凸显了在极长时域上进行样本高效学习的重要性。本文通过研究当前回报依赖于最多$s$个先前动作与上下文(无需连续)且时间跨度达$h$的上下文线性赌博机问题,在此方向迈出一步。为避免对$h$的多项式依赖,我们提出利用稀疏性联合发现依赖模式与臂参数的新算法。我们分别考虑数据稀缺($T<h$)与数据充足($T\ge h$)两种场景,推导出相应的遗憾上界$\tilde O(d\sqrt{sT} +\min\{ q, T\})$和$\tilde O(\sqrt{sdT})$,其中$s$为稀疏度,$d$为特征维度,$T$为总时间跨度,$q$为自适应于回报依赖模式的参数。与上界互补的是,我们证明基于单条轨迹的学习具有内在挑战:尽管依赖模式与臂参数构成秩-1矩阵,但循环矩阵在秩-1流形上并非等距映射,样本复杂度确实受益于稀疏回报依赖结构。本文研究需要新的分析框架以处理跨数据的长程时间依赖问题,并避免对回报时域$h$的多项式依赖。具体而言,我们利用依赖子高斯向量构成的循环矩阵的受限等距性质,建立了具有独立理论意义的全新保证。