The structure of data organization is widely recognized as having a substantial influence on the efficacy of machine learning algorithms, particularly in binary classification tasks. Our research provides a theoretical framework suggesting that the maximum potential of binary classifiers on a given dataset is primarily constrained by the inherent qualities of the data. Through both theoretical reasoning and empirical examination, we employed standard objective functions, evaluative metrics, and binary classifiers to arrive at two principal conclusions. Firstly, we show that the theoretical upper bound of binary classification performance on actual datasets can be theoretically attained. This upper boundary represents a calculable equilibrium between the learning loss and the metric of evaluation. Secondly, we have computed the precise upper bounds for three commonly used evaluation metrics, uncovering a fundamental uniformity with our overarching thesis: the upper bound is intricately linked to the dataset's characteristics, independent of the classifier in use. Additionally, our subsequent analysis uncovers a detailed relationship between the upper limit of performance and the level of class overlap within the binary classification data. This relationship is instrumental for pinpointing the most effective feature subsets for use in feature engineering.
翻译:数据组织结构被广泛认为对机器学习算法的效能有着实质性影响,特别是在二分类任务中。我们的研究提供了一个理论框架,表明在给定数据集上二分类器的最大潜力主要受数据固有性质的约束。通过理论推理与实证检验,我们采用标准目标函数、评估指标及二分类器,得出两个主要结论。首先,我们证明了实际数据集上二分类性能的理论上界是理论上可达的。该上界代表了学习损失与评估指标之间的可计算平衡点。其次,我们计算了三种常用评估指标的精确上界,揭示出一个与我们的总体论点根本一致的事实:上界与数据集特征紧密相关,而与所用分类器无关。此外,后续分析揭示了性能上限与二分类数据中类别重叠程度之间的详细关系。这一关系对于确定特征工程中最有效的特征子集具有指导意义。