We consider the sequential decision-making problem where the mean outcome is a non-linear function of the chosen action. Compared with the linear model, two curious phenomena arise in non-linear models: first, in addition to the "learning phase" with a standard parametric rate for estimation or regret, there is an "burn-in period" with a fixed cost determined by the non-linear function; second, achieving the smallest burn-in cost requires new exploration algorithms. For a special family of non-linear functions named ridge functions in the literature, we derive upper and lower bounds on the optimal burn-in cost, and in addition, on the entire learning trajectory during the burn-in period via differential equations. In particular, a two-stage algorithm that first finds a good initial action and then treats the problem as locally linear is statistically optimal. In contrast, several classical algorithms, such as UCB and algorithms relying on regression oracles, are provably suboptimal.
翻译:我们考虑序贯决策问题,其中平均结果是由所选动作的非线性函数决定的。与线性模型相比,非线性模型中出现两个有趣的现象:第一,除了具有标准参数速率(用于估计或遗憾)的“学习阶段”外,还存在一个由非线性函数决定固定成本的“预热期”;第二,实现最小预热成本需要新的探索算法。针对文献中称为岭形函数的特殊非线性函数族,我们推导了最优预热成本的上界和下界,并通过微分方程进一步刻画了预热期内完整的学习轨迹。特别地,一种先寻找良好初始动作、再将该问题视为局部线性化的两阶段算法具有统计最优性。相比之下,经典算法(如UCB及依赖回归预言机的算法)被证明是严格次优的。