Machine Learning (ML) algorithms are powerful data-driven tools for approximating high-dimensional or non-linear nuisance functions which are useful in practice because the true functional form of the predictors is ex-ante unknown. In this paper, we develop estimators of policy interventions from panel data which allow for non-linear effects of the confounding regressors, and investigate the performance of these estimators using three well-known ML algorithms, specifically, LASSO, classification and regression trees, and random forests. We use Double Machine Learning (DML) (Chernozhukov et al., 2018) for the estimation of causal effects of homogeneous treatments with unobserved individual heterogeneity (fixed effects) and no unobserved confounding by extending Robinson (1988)'s partially linear regression model. We develop three alternative approaches for handling unobserved individual heterogeneity based on extending the within-group estimator, first-difference estimator, and correlated random effect estimator (Mundlak, 1978) for non-linear models. Using Monte Carlo simulations, we find that conventional least squares estimators can perform well even if the data generating process is non-linear, but there are substantial performance gains in terms of bias reduction under a process where the true effect of the regressors is non-linear and discontinuous. However, for the same scenarios, we also find -- despite extensive hyperparameter tuning -- inference to be problematic for both tree-based learners because these lead to highly non-normal estimator distributions and the estimator variance being severely under-estimated. This contradicts the performance of trees in other circumstances and requires further investigation. Finally, we provide an illustrative example of DML for observational panel data showing the impact of the introduction of the national minimum wage in the UK.
翻译:机器学习算法是强大的数据驱动工具,用于逼近高维或非线性干扰函数,这在实践中具有重要价值,因为预测变量的真实函数形式在事前未知。本文针对面板数据开发了政策干预的估计量,该估计量允许混杂回归变量存在非线性效应,并利用三种经典机器学习算法(LASSO、分类与回归树、随机森林)研究了这些估计量的性能。通过扩展Robinson(1988)提出的部分线性回归模型,我们采用双机器学习方法(Chernozhukov等,2018)来估计具有未观测个体异质性(固定效应)且无未观测混杂的同质处理效应。我们基于三种非线性模型的扩展方法(群内估计量、一阶差分估计量和相关随机效应估计量(Mundlak,1978)),提出了处理未观测个体异质性的三种替代方案。蒙特卡洛模拟表明:即使数据生成过程为非线性,传统最小二乘估计量仍可表现良好;但当回归变量的真实效应呈现非连续非线性时,双机器学习方法在偏差缩减方面具有显著优势。然而,在相同场景下,尽管进行了广泛的超参数调优,基于树的两种学习器的推断仍存在问题——这导致估计量分布高度非正态,且估计量方差被严重低估。这一现象与树模型在其他情境下的表现相矛盾,需要进一步研究。最后,我们通过一个观测面板数据案例(展现英国国家最低工资引入的影响)对双机器学习方法进行了实证说明。