A dataset with two labels is linearly separable if it can be split into its two classes with a hyperplane. This inflicts a curse on some statistical tools (such as logistic regression) but forms a blessing for others (e.g. support vector machines). Recently, the following question has regained interest: What is the probability that the data are linearly separable? We provide a formula for the probability of linear separability for Gaussian features and labels depending only on one marginal of the features (as in generalized linear models). In this setting, we derive an upper bound that complements the recent result by Hayakawa, Lyons, and Oberhauser [2023], and a sharp upper bound for sign-flip noise. To prove our results, we exploit that this probability can be expressed as a sum of the intrinsic volumes of a polyhedral cone of the form $\text{span}\{v\}\oplus[0,\infty)^n$, as shown in Cand\`es and Sur [2020]. After providing the inequality description for this cone, and an algorithm to project onto it, we calculate its intrinsic volumes. In doing so, we encounter Youden's demon problem, for which we provide a formula following Kabluchko and Zaporozhets [2020]. The key insight of this work is the following: The number of correctly labeled observations in the data affects the structure of this polyhedral cone, allowing the translation of insights from geometry into statistics.
翻译:如果一个具有两个标签的数据集能够被一个超平面分割成两个类别,则称其为线性可分的。这给某些统计工具(如逻辑回归)带来了“诅咒”,却为其他工具(如支持向量机)带来了“福音”。近期,以下问题重新引起了研究兴趣:数据线性可分的概率是多少?我们针对高斯特征和标签(仅依赖于特征的一个边际,如广义线性模型中的情况)给出了线性可分性概率的公式。在此设定下,我们推导出一个上界,补充了Hayakawa、Lyons和Oberhauser [2023]的最新结果,并给出了符号翻转噪声下的严格上界。为证明我们的结论,我们利用了该概率可表示为形如$\text{span}\{v\}\oplus[0,\infty)^n$的多面体锥的内蕴体积之和这一事实(参见Candès和Sur [2020])。在给出该锥的不等式描述及投影算法后,我们计算了其内蕴体积。在此过程中,我们遇到了尤登恶魔问题,并依照Kabluchko和Zaporozhets [2020]的方法给出了该问题的公式。本研究的关键洞见如下:数据中正确标注的观测数量会影响该多面体锥的结构,从而将几何洞察转化为统计结论。