The structure of data organization is widely recognized as having a substantial influence on the efficacy of machine learning algorithms, particularly in binary classification tasks. Our research provides a theoretical framework suggesting that the maximum potential of binary classifiers on a given dataset is primarily constrained by the inherent qualities of the data. Through both theoretical reasoning and empirical examination, we employed standard objective functions, evaluative metrics, and binary classifiers to arrive at two principal conclusions. Firstly, we show that the theoretical upper bound of binary classification performance on actual datasets can be theoretically attained. This upper boundary represents a calculable equilibrium between the learning loss and the metric of evaluation. Secondly, we have computed the precise upper bounds for three commonly used evaluation metrics, uncovering a fundamental uniformity with our overarching thesis: the upper bound is intricately linked to the dataset's characteristics, independent of the classifier in use. Additionally, our subsequent analysis uncovers a detailed relationship between the upper limit of performance and the level of class overlap within the binary classification data. This relationship is instrumental for pinpointing the most effective feature subsets for use in feature engineering.
翻译:数据组织结构对机器学习算法效能具有显著影响,这一观点在二元分类任务中已获得广泛认同。本研究提出一个理论框架,表明二元分类器在给定数据集上的最大潜力主要受数据内在特性的制约。通过理论推导与实证检验,我们采用标准目标函数、评估指标和二元分类器得出两个主要结论。首先,我们证明实际数据集上二元分类性能的理论上限在理论上是可以达到的。该上限体现了学习损失与评估指标之间的可计算平衡点。其次,我们针对三种常用评估指标计算了精确的上界值,发现其与我们核心论点存在根本一致性:性能上界与数据集特征紧密相关,且独立于所用分类器。此外,我们的后续分析揭示了性能上限与二元分类数据中类别重叠程度之间的精细关联。这种关联对于确定特征工程中最有效的特征子集具有重要指导意义。