Combining many cross-sectional return predictors (for example, in machine learning) often requires imputing missing values. We compare ad-hoc mean imputation with several methods including maximum likelihood. Surprisingly, maximum likelihood and ad-hoc methods lead to similar results. This is because predictors are largely independent: Correlations cluster near zero and 10 principal components (PCs) span less than 50% of total variance. Independence implies observed predictors are uninformative about missing predictors, making ad-hoc methods valid. In PC regression tests, 50 PCs are required to capture equal-weighted expected returns (30 PCs value-weighted), regardless of the imputation. We find similar invariance in neural network portfolios.
翻译:在整合多个横截面收益预测因子时(例如在机器学习中),通常需要对缺失值进行插补。本文将临时均值插补与最大似然法等多种方法进行比较。令人惊讶的是,最大似然法与临时方法得到的结果相似。这是因为预测因子在很大程度上是相互独立的:相关性接近于零,且10个主成分的解释总方差不足50%。独立性意味着观测到的预测因子无法为缺失预测因子提供有效信息,这使临时方法具备合理性。在主成分回归测试中,无论采用何种插补方法,均需50个主成分才能捕捉等权预期收益(价值加权则需要30个主成分)。我们在神经网络投资组合中也发现了类似的不变性。