The leading strategy for analyzing unstructured data uses two steps. First, latent variables of economic interest are estimated with an upstream information retrieval model. Second, the estimates are treated as "data" in a downstream econometric model. We establish theoretical arguments for why this two-step strategy leads to biased inference in empirically plausible settings. More constructively, we propose a one-step strategy for valid inference that uses the upstream and downstream models jointly. The one-step strategy (i) substantially reduces bias in simulations; (ii) has quantitatively important effects in a leading application using CEO time-use data; and (iii) can be readily adapted by applied researchers.
翻译:分析非结构化数据的主流策略采用两步法:首先,利用上游信息检索模型估计具有经济内涵的潜变量;其次,将估计值作为"数据"代入下游计量经济模型。我们建立了理论依据,证明这种两步策略在经验合理的设定下会导致有偏推断。更具建设性的是,我们提出了一种联合运用上下游模型进行有效推断的一步法。该一步法:(i)在模拟中显著降低偏差;(ii)在基于CEO时间利用数据的领先应用中产生量化重要影响;(iii)可便捷地被应用研究者采纳。