Machine Learning (ML) algorithms are powerful data-driven tools for approximating highdimensional or non-linear nuisance functions which are useful in practice because the true functional form of the predictors is ex-ante unknown. In this paper, we develop estimators of policy interventions from panel data which allow for non-linear effects of the confounding regressors, and investigate the performance of these estimators using three well-known ML algorithms, specifically, LASSO, classification and regression trees, and random forests. We use Double Machine Learning (DML) (Chernozhukov et al., 2018) for the estimation of causal effects of homogeneous treatments with unobserved individual heterogeneity (fixed effects) and no unobserved confounding by extending Robinson (1988)'s partially linear regression model. We develop three alternative approaches for handling unobserved individual heterogeneity based on extending the within-group estimator, first-difference estimator, and correlated random effect estimator (Mundlak, 1978) for non-linear models. Using Monte Carlo simulations, we find that conventional least squares estimators can perform well even if the data generating process is nonlinear, but there are substantial performance gains in terms of bias reduction under a process where the true effect of the regressors is non-linear and discontinuous. However, for the same scenarios, we also find - despite extensive hyperparameter tuning - inference to be problematic for both tree-based learners because these lead to highly non-normal estimator distributions and the estimator variance being severely under-estimated. This contradicts the performance of trees in other circumstances and requires further investigation. Finally, we provide an illustrative example of DML for observational panel data showing the impact of the introduction of the national minimum wage in the UK.
翻译:机器学习算法作为一种强大的数据驱动工具,能够近似高维或非线性的干扰函数,这一特性在实际应用中尤为关键,因为预测变量的真实函数形式通常是未知的。本文基于面板数据,提出了允许混淆变量存在非线性效应的政策干预效应估计量,并利用三种经典机器学习算法(LASSO、分类与回归树以及随机森林)评估其性能。通过拓展Robinson(1988)的偏线性回归模型,我们采用双重机器学习方法(Chernozhukov等,2018)来估计个体异质性(固定效应)不可观测且无未观测混杂情形下的同质处理效应。针对非线性模型,我们发展了三种替代方法处理未观测个体异质性:基于组内估计量、一阶差分估计量以及相关随机效应估计量(Mundlak,1978)的拓展形式。蒙特卡洛模拟结果表明,即使数据生成过程存在非线性,传统最小二乘估计量仍可表现良好;但在真实效应为非连续非线性过程时,该方法在偏差缩减方面具有显著优势。然而,同样场景下我们发现——尽管经过广泛的超参数调优——基于树的两种学习器均存在推断问题:其估计量分布高度非正态且方差严重低估,这与树模型在其他场景中的表现相悖,有待进一步研究。最后,本文以英国国家最低工资政策实施效果为案例,展示了双重机器学习方法在观测面板数据中的应用实例。