Big data analytics has opened new avenues in economic research, but the challenge of analyzing datasets with tens of millions of observations is substantial. Conventional econometric methods based on extreme estimators require large amounts of computing resources and memory, which are often not readily available. In this paper, we focus on linear quantile regression applied to ``ultra-large'' datasets, such as U.S. decennial censuses. A fast inference framework is presented, utilizing stochastic sub-gradient descent (S-subGD) updates. The inference procedure handles cross-sectional data sequentially: (i) updating the parameter estimate with each incoming "new observation", (ii) aggregating it as a Polyak-Ruppert average, and (iii) computing a pivotal statistic for inference using only a solution path. The methodology draws from time series regression to create an asymptotically pivotal statistic through random scaling. Our proposed test statistic is calculated in a fully online fashion and critical values are calculated without resampling. We conduct extensive numerical studies to showcase the computational merits of our proposed inference. For inference problems as large as $(n, d) \sim (10^7, 10^3)$, where $n$ is the sample size and $d$ is the number of regressors, our method generates new insights, surpassing current inference methods in computation. Our method specifically reveals trends in the gender gap in the U.S. college wage premium using millions of observations, while controlling over $10^3$ covariates to mitigate confounding effects.
翻译:大数据分析为经济学研究开辟了新途径,但分析千万级观测数据集的挑战依然严峻。基于极值估计的传统计量经济学方法需要大量计算资源和内存,这些条件往往难以满足。本文聚焦于应用于“超大规模”数据集(如美国十年一次的人口普查数据)的线性分位数回归。我们提出了一种快速推断框架,利用随机次梯度下降(S-subGD)更新。该推断过程顺序处理截面数据:(i)用每个新传入的“观测值”更新参数估计;(ii)将其聚合为Polyak-Ruppert平均;(iii)仅利用解路径计算推断的枢轴统计量。该方法借鉴时间序列回归思想,通过随机缩放构建渐近枢轴统计量。我们提出的检验统计量完全在线计算,临界值无需重抽样即可获得。通过大量数值研究展示了所提推断方法的计算优势。对于规模达$(n, d) \sim (10^7, 10^3)$(其中$n$为样本量,$d$为回归变量数)的推断问题,我们的方法产生了新的见解,在计算上超越了现有推断方法。具体而言,利用数百万观测值,在控制$10^3$个协变量以减弱混杂效应的情况下,该方法揭示了美国大学工资溢价中性别差距的变化趋势。