Many forecasting applications have a limited distributed target variable, which is zero for most observations and positive for the remaining observations. In the econometrics literature, there is much research about statistical model building for limited distributed target variables. Especially, there are two component model approaches, where one model is build for the probability of the target to be positive and one model for the actual value of the target, given that it is positive. However, the econometric literature focuses on effect estimation and does not provide theory for predictive modeling. Nevertheless, some concepts like the two component model approach and Heckmann's sample selection correction also appear in the predictive modeling literature, without a sound theoretical foundation. In this paper, we theoretically analyze predictive modeling for limited dependent variables and derive best practices. By analyzing various real-world data sets, we also use the derived theoretical results to explain which predictive modeling approach works best on which application.
翻译:许多预测应用中存在有限分布的目标变量,该变量在大多数观测中为零,而在其余观测中为正。在计量经济学文献中,针对有限分布目标变量的统计模型构建已有大量研究。特别地,存在双分量模型方法,即分别构建目标变量为正的概率模型以及给定目标为正时的实际值模型。然而,计量经济学文献聚焦于效应估计,并未提供预测建模的理论基础。尽管如此,双分量模型方法和赫克曼样本选择矫正等概念也出现在预测建模文献中,但缺乏坚实的理论依据。本文从理论上分析了有限分布因变量的预测建模,并推导出最佳实践。通过分析多个真实世界数据集,我们还利用推导的理论结果,解释了哪种预测建模方法最适合哪类应用场景。