Small regional datasets pose a dual statistical problem: correlated predictors inflate estimation variance, while flexible learners can become unstable because the available information per adaptive degree of freedom is limited. We examine this issue through predictive volatility, defined as the cross-sample dispersion and upper-tail behaviour of out-of-sample loss. Using simulation evidence reported for sparse linear, near-linear and heavy-tailed settings, we compare ordinary least squares, frequentist penalties, Bayesian shrinkage models, bounded-response and spatial specifications, and flexible machine-learning procedures. In the reported simulation results, regularised linear estimators generally dominate in the linear high-collinearity micro-sample settings and remain the most reliable overall, whereas tree-based methods become more competitive only when the signal is weakly nonlinear and the sample size is larger. In the empirical application to 34 Indonesian provinces, ridge yields the best leave-one-out performance, followed by elastic net and lasso. Across the Bayesian shrinkage specifications, ICT skills show the most consistent negative association with poverty, with the strongest support under horseshoe and spike-and-slab formulations. These results suggest that, in micro-sample regional modelling, the main constraint is limited information per effective degree of freedom rather than insufficient algorithmic flexibility.
翻译:小型区域数据集存在双重统计问题:相关预测因子会增大估计方差,而灵活的学习器可能因每个自适应自由度的可用信息有限而变得不稳定。我们通过预测波动性(定义为样本外损失在跨样本离散度与上尾行为中的表现)来研究这一问题。基于稀疏线性、近线性和重尾设定下的模拟证据,我们比较了普通最小二乘法、频率学派惩罚项、贝叶斯收缩模型、有界响应与空间规范模型,以及灵活的机器学习方法。在报告的模拟结果中,正则化线性估计量通常在线性高共线性小样本设定中占优,且整体上最为可靠;而基于树的方法仅在信号弱非线性且样本量较大时更具竞争力。在对印度尼西亚34个省份的实证应用中,岭回归取得了最佳的留一法性能,其次是弹性网络和套索。在贝叶斯收缩规范中,ICT技能与贫困呈最一致的负相关,且在马蹄铁和尖峰-板先验设定下获得最强支持。这些结果表明,在小样本区域建模中,主要约束因素在于每个有效自由度可用信息有限,而非算法灵活性不足。