We characterize the structure and origins of missingness for 159 cross-sectional return predictors and study missing value handling for portfolios constructed using machine learning. Simply imputing with cross-sectional means performs well compared to rigorous expectation-maximization methods. This stems from three facts about predictor data: (1) missingness occurs in large blocks organized by time, (2) cross-sectional correlations are small, and (3) missingness tends to occur in blocks organized by the underlying data source. As a result, observed data provide little information about missing data. Sophisticated imputations introduce estimation noise that can lead to underperformance if machine learning is not carefully applied.
翻译:我们刻画了159个横截面收益预测因子的缺失值结构与成因,并研究了基于机器学习的投资组合构建中的缺失值处理问题。与严格的期望最大化方法相比,使用横截面均值进行简单插补表现出良好的性能。这一现象源于预测因子数据的三个特征:(1)缺失值以按时间组织的大区块形式出现,(2)横截面相关性较小,(3)缺失值倾向于按底层数据源组织的方式聚集。因此,观测数据几乎无法提供关于缺失数据的信息。复杂的插补方法会引入估计噪声,若未谨慎应用机器学习,可能导致模型表现欠佳。