With the fast development of big data, learning the optimal decision rule by recursively updating it and making online decisions has been easier than before. We study the online statistical inference of model parameters in a contextual bandit framework of sequential decision-making. We propose a general framework for an online and adaptive data collection environment that can update decision rules via weighted stochastic gradient descent. We allow different weighting schemes of the stochastic gradient and establish the asymptotic normality of the parameter estimator. Our proposed estimator significantly improves the asymptotic efficiency over the previous averaged SGD approach via inverse probability weights. We also conduct an optimality analysis on the weights in a linear regression setting. We provide a Bahadur representation of the proposed estimator and show that the remainder term in the Bahadur representation entails a slower convergence rate compared to classical SGD due to the adaptive data collection.
翻译:随着大数据的快速发展,通过递归更新决策规则并在线决策来学习最优策略已变得比以往更加容易。我们在序列决策的情境赌博机框架下研究模型参数的在线统计推断。我们提出了一种适用于在线自适应数据收集环境的通用框架,该框架可通过加权随机梯度下降更新决策规则。我们允许随机梯度采用不同的加权方案,并建立了参数估计量的渐近正态性。与先前基于逆概率权重的平均随机梯度下降方法相比,我们提出的估计量显著提升了渐近效率。我们还在线性回归设定下对权重进行了最优性分析。我们给出了所提估计量的巴哈杜尔表示,并证明由于自适应数据收集的特性,巴哈杜尔表示中的余项比经典随机梯度下降具有更慢的收敛速率。