Big data analytics has opened new avenues in economic research, but the challenge of analyzing datasets with tens of millions of observations is substantial. Conventional econometric methods based on extreme estimators require large amounts of computing resources and memory, which are often not readily available. In this paper, we focus on linear quantile regression applied to "ultra-large" datasets, such as U.S. decennial censuses. A fast inference framework is presented, utilizing stochastic subgradient descent (S-subGD) updates. The inference procedure handles cross-sectional data sequentially: (i) updating the parameter estimate with each incoming "new observation", (ii) aggregating it as a $\textit{Polyak-Ruppert}$ average, and (iii) computing a pivotal statistic for inference using only a solution path. The methodology draws from time-series regression to create an asymptotically pivotal statistic through random scaling. Our proposed test statistic is calculated in a fully online fashion and critical values are calculated without resampling. We conduct extensive numerical studies to showcase the computational merits of our proposed inference. For inference problems as large as $(n, d) \sim (10^7, 10^3)$, where $n$ is the sample size and $d$ is the number of regressors, our method generates new insights, surpassing current inference methods in computation. Our method specifically reveals trends in the gender gap in the U.S. college wage premium using millions of observations, while controlling over $10^3$ covariates to mitigate confounding effects.
翻译:大数据分析为经济研究开辟了新途径,但分析包含数千万观测值的数据集仍面临巨大挑战。基于极值估计的传统计量方法需要大量计算资源和内存,这些条件通常难以满足。本文聚焦于应用于"超大规模"数据集(如美国十年一次人口普查数据)的线性分位数回归。我们提出了一种快速推断框架,利用随机次梯度下降(S-subGD)更新方法。该推断过程按序处理截面数据:(i)每接收到一个"新观测值"就更新参数估计,(ii)将其聚合为$\textit{Polyak-Ruppert}$平均值,(iii)仅利用解路径计算用于推断的枢轴统计量。该方法借鉴时间序列回归思想,通过随机缩放构造渐近枢轴统计量。我们提出的检验统计量完全以在线方式计算,且临界值的计算无需重抽样。通过大量数值研究,我们展示了所提推断方法的计算优势。对于$(n, d) \sim (10^7, 10^3)$(其中$n$为样本量,$d$为回归变量数量)量级的推断问题,我们的方法在计算效率上超越了现有推断方法,并产生新的见解。具体而言,该方法利用数百万观测值揭示了美国大学工资溢价中性别差距的演变趋势,同时控制了$10^3$个以上协变量以减轻混杂效应。