Topic models are often used as dimension-reduction tools before regression, with estimated document-level topic shares treated as observed covariates. This plug-in workflow creates two inferential difficulties: valid inference requires a regular first-stage-to-second-stage expansion that propagates topic-estimation uncertainty, and, at fixed document length, a document's topic mixture cannot be consistently recovered from its own words even when the population topic matrix is known. Corrected spectral moment methods for latent Dirichlet allocation (LDA) offer a starting point: when the total Dirichlet concentration is known, low-order word moments can be corrected to yield operators diagonal in the latent topic basis. We extend this to downstream regression. Under a finite LDA model with response residuals orthogonal to the low-order token moments used for identification, response-weighted word moments admit the same correction, and the resulting supervised operator identifies the regression coefficient $β$ directly, without estimating document-level topic shares. The main obstacle is that the correction depends on the unknown total concentration $α_0$. We show that, for $k\ge3$ topics and under a generic finite-probe condition, $α_0$ is identified by commutativity: at the true value a family of corrected word-moment operators commute, whereas away from it they generically do not. This yields a feasible estimator and lets uncertainty in $\hatα_0$ propagate into inference for $β$. The estimator is asymptotically linear as the number of documents grows with fixed document length, with sandwich standard errors from document-level moment contributions. Simulations show near-nominal coverage where plug-in topic-share regressions can undercover, and an application to top economics journals illustrates contrast inference for latent topic effects.
翻译:主题模型常作为回归前的降维工具使用,估计的文档级主题占比被当作观测协变量。这种代入式工作流导致两个推断困难:有效推断需要一阶段向二阶段的规则展开以传播主题估计不确定性;而在固定文档长度条件下,即使已知总体主题矩阵,也无法从文档自身词汇中一致地估计其主题混合。潜狄利克雷分配(LDA)的校正谱矩方法提供了起点:当总狄利克雷集中参数已知时,低阶词矩可被校正为在潜主题基上对角化的算子。我们将此方法扩展至下游回归。在响应残差与用于识别的低阶词矩正交的有限LDA模型下,响应加权词矩允许相同的校正,由此得到的监督算子可直接识别回归系数β,无需估计文档级主题占比。主要障碍在于校正依赖于未知的总集中参数α₀。我们证明,对于k≥3个主题且在通用有限探测条件下,α₀可通过交换性识别:在真实值处,一族校正词矩算子可交换,而在偏离真值处则通常不可交换。由此得到可行估计量,并使α̂₀的不确定性可传播至β的推断中。该估计量在文档数增长、文档长度固定时具有渐近线性性质,其夹心标准误来自文档级矩贡献。模拟显示,在代入式主题占比回归可能覆盖不足的场景下,该方法达到近名义覆盖率。对顶级经济学期刊的应用阐释了潜主题效应的对比推断。