In machine learning models, the estimation of errors is often complex due to distribution bias, particularly in spatial data such as those found in environmental studies. We introduce an approach based on the ideas of importance sampling to obtain an unbiased estimate of the target error. By taking into account difference between desirable error and available data, our method reweights errors at each sample point and neutralizes the shift. Importance sampling technique and kernel density estimation were used for reweighteing. We validate the effectiveness of our approach using artificial data that resemble real-world spatial datasets. Our findings demonstrate advantages of the proposed approach for the estimation of the target error, offering a solution to a distribution shift problem. Overall error of predictions dropped from 7% to just 2% and it gets smaller for larger samples.
翻译:在机器学习模型中,由于分布偏差(尤其在环境研究等空间数据中)的存在,误差估计往往变得复杂。我们引入了一种基于重要性采样思想的方法,以获取目标误差的无偏估计。通过考虑期望误差与可用数据之间的差异,我们的方法对每个样本点的误差进行重加权,从而抵消分布偏移。研究中采用重要性采样技术与核密度估计进行重加权。我们利用模拟真实空间数据集的人工数据验证了该方法的有效性。结果表明,所提出的方法在目标误差估计方面具有优势,为解决分布偏移问题提供了可行方案。预测的整体误差从7%降至仅2%,且随着样本量增大,误差进一步减小。