Empirical models of demand for differentiated products rely on low-dimensional product representations to capture substitution patterns. These representations are increasingly proxied by applying ML methods to high-dimensional, unstructured data, including product descriptions and images. When proxies fail to capture the true dimensions of differentiation that drive substitution, standard workflows will deliver biased counterfactuals and invalid inference. We develop a practical toolkit that corrects this bias and ensures valid inference for a broad class of counterfactuals. Our approach applies to market-level and/or individual data, requires minimal additional computation, is efficient, delivers simple formulas for standard errors, and accommodates data-dependent proxies, including embeddings from fine-tuned ML models. It can also be used with standard quantitative attributes when mismeasurement is a concern. In addition, we propose diagnostics to assess the adequacy of the proxy construction and dimension. The approach yields meaningful improvements in predicting counterfactual substitution in both simulations and an empirical application.
翻译:差异化产品的需求实证模型依赖于低维产品表征来捕捉替代模式。这些表征越来越多地通过将机器学习方法应用于高维非结构化数据(包括产品描述和图像)来近似替代。当替代指标未能捕捉驱动替代的真实差异化维度时,标准工作流程将产生有偏的反事实估计和无效的统计推断。我们开发了一个实用工具包,用于纠正这种偏差,并确保对广泛类别的反事实进行有效的推断。我们的方法适用于市场层面和/或个体数据,仅需极少额外计算,具有高效性,提供标准误差的简洁计算公式,并能兼容数据依赖的替代指标(包括来自微调机器学习模型的嵌入向量)。当存在测量误差担忧时,该方法也可与标准的量化属性结合使用。此外,我们提出了用于评估替代指标构建与维度充分性的诊断方法。在模拟实验和实证应用中,该方法在预测反事实替代模式方面均取得了显著改进。