In the field of machine learning, regression problems are pivotal due to their ability to predict continuous outcomes. Traditional error metrics like mean squared error, mean absolute error, and coefficient of determination measure model accuracy. The model accuracy is the consequence of the selected model and the features, which blurs the analysis of contribution. Predictability, in the other hand, focus on the predictable level of a target variable given a set of features. This study introduces conditional entropy estimators to assess predictability in regression problems, bridging this gap. We enhance and develop reliable conditional entropy estimators, particularly the KNIFE-P estimator and LMC-P estimator, which offer under- and over-estimation, providing a practical framework for predictability analysis. Extensive experiments on synthesized and real-world datasets demonstrate the robustness and utility of these estimators. Additionally, we extend the analysis to the coefficient of determination \(R^2 \), enhancing the interpretability of predictability. The results highlight the effectiveness of KNIFE-P and LMC-P in capturing the achievable performance and limitations of feature sets, providing valuable tools in the development of regression models. These indicators offer a robust framework for assessing the predictability for regression problems.
翻译:在机器学习领域,回归问题因其能够预测连续结果而至关重要。传统误差度量指标如均方误差、平均绝对误差和决定系数用于衡量模型精度。模型精度是所选模型与特征共同作用的结果,这模糊了对各自贡献的分析。而可预测性则关注在给定一组特征条件下目标变量的可预测程度。本研究引入条件熵估计量来评估回归问题的可预测性,从而弥补这一空白。我们改进并开发了可靠的条件熵估计量,特别是KNIFE-P估计量和LMC-P估计量,它们分别提供低估和高估估计,为可预测性分析提供了实用框架。在合成数据集和真实数据集上进行的大量实验证明了这些估计量的鲁棒性和实用性。此外,我们将分析扩展到决定系数 \(R^2\),增强了可预测性的可解释性。结果凸显了KNIFE-P和LMC-P在捕捉特征集可达到的性能与局限性方面的有效性,为回归模型的开发提供了有价值的工具。这些指标为评估回归问题的可预测性提供了稳健的框架。