We consider the sequential decision-making problem where the mean outcome is a non-linear function of the chosen action. Compared with the linear model, two curious phenomena arise in non-linear models: first, in addition to the "learning phase" with a standard parametric rate for estimation or regret, there is an "burn-in period" with a fixed cost determined by the non-linear function; second, achieving the smallest burn-in cost requires new exploration algorithms. For a special family of non-linear functions named ridge functions in the literature, we derive upper and lower bounds on the optimal burn-in cost, and in addition, on the entire learning trajectory during the burn-in period via differential equations. In particular, a two-stage algorithm that first finds a good initial action and then treats the problem as locally linear is statistically optimal. In contrast, several classical algorithms, such as UCB and algorithms relying on regression oracles, are provably suboptimal.
翻译:我们考虑顺序决策问题,其中平均结果是所选动作的非线性函数。与线性模型相比,非线性模型中出现两个奇特现象:首先,除了具有标准参数速率的学习阶段(用于估计或遗憾)之外,还存在一个“预热期”,其固定成本由非线性函数决定;其次,实现最小的预热成本需要新的探索算法。针对文献中名为岭函数的特殊非线性函数族,我们推导了最优预热成本的上下界,此外,还通过微分方程描述了预热期内整个学习轨迹。特别地,一种先找到良好初始动作、然后将问题视为局部线性的两阶段算法在统计上是最优的。相比之下,几种经典算法,如UCB和依赖回归或谓词的算法,被证明是次优的。