Moment-Based Inference for Regression with Latent Dirichlet Covariates

Topic models are often used as dimension-reduction tools before regression, with estimated document-level topic shares treated as observed covariates. This plug-in workflow creates two inferential difficulties: valid inference requires a regular first-stage-to-second-stage expansion that propagates topic-estimation uncertainty, and, at fixed document length, a document's topic mixture cannot be consistently recovered from its own words even when the population topic matrix is known. Corrected spectral moment methods for latent Dirichlet allocation (LDA) offer a starting point: when the total Dirichlet concentration is known, low-order word moments can be corrected to yield operators diagonal in the latent topic basis. We extend this to downstream regression. Under a finite LDA model with response residuals orthogonal to the low-order token moments used for identification, response-weighted word moments admit the same correction, and the resulting supervised operator identifies the regression coefficient $β$ directly, without estimating document-level topic shares. The main obstacle is that the correction depends on the unknown total concentration $α_0$. We show that, for $k\ge3$ topics and under a generic finite-probe condition, $α_0$ is identified by commutativity: at the true value a family of corrected word-moment operators commute, whereas away from it they generically do not. This yields a feasible estimator and lets uncertainty in $\hatα_0$ propagate into inference for $β$. The estimator is asymptotically linear as the number of documents grows with fixed document length, with sandwich standard errors from document-level moment contributions. Simulations show near-nominal coverage where plug-in topic-share regressions can undercover, and an application to top economics journals illustrates contrast inference for latent topic effects.

翻译：主题模型常作为回归前的降维工具使用，估计的文档级主题占比被当作观测协变量。这种代入式工作流导致两个推断困难：有效推断需要一阶段向二阶段的规则展开以传播主题估计不确定性；而在固定文档长度条件下，即使已知总体主题矩阵，也无法从文档自身词汇中一致地估计其主题混合。潜狄利克雷分配（LDA）的校正谱矩方法提供了起点：当总狄利克雷集中参数已知时，低阶词矩可被校正为在潜主题基上对角化的算子。我们将此方法扩展至下游回归。在响应残差与用于识别的低阶词矩正交的有限LDA模型下，响应加权词矩允许相同的校正，由此得到的监督算子可直接识别回归系数β，无需估计文档级主题占比。主要障碍在于校正依赖于未知的总集中参数α₀。我们证明，对于k≥3个主题且在通用有限探测条件下，α₀可通过交换性识别：在真实值处，一族校正词矩算子可交换，而在偏离真值处则通常不可交换。由此得到可行估计量，并使α̂₀的不确定性可传播至β的推断中。该估计量在文档数增长、文档长度固定时具有渐近线性性质，其夹心标准误来自文档级矩贡献。模拟显示，在代入式主题占比回归可能覆盖不足的场景下，该方法达到近名义覆盖率。对顶级经济学期刊的应用阐释了潜主题效应的对比推断。