In this paper we revisit the classical method of partitioning classification and study its convergence rate under relaxed conditions, both for observable (non-privatised) and for privatised data. Let the feature vector $X$ take values in $\mathbb{R}^d$ and denote its label by $Y$. Previous results on the partitioning classifier worked with the strong density assumption, which is restrictive, as we demonstrate through simple examples. We assume that the distribution of $X$ is a mixture of an absolutely continuous and a discrete distribution, such that the absolutely continuous component is concentrated to a $d_a$ dimensional subspace. Here, we study the problem under much milder assumptions: in addition to the standard Lipschitz and margin conditions, a novel characteristic of the absolutely continuous component is introduced, by which the exact convergence rate of the classification error probability is calculated, both for the binary and for the multi-label cases. Interestingly, this rate of convergence depends only on the intrinsic dimension $d_a$. The privacy constraints mean that the data $(X_1,Y_1), \dots ,(X_n,Y_n)$ cannot be directly observed, and the classifiers are functions of the randomised outcome of a suitable local differential privacy mechanism. The statistician is free to choose the form of this privacy mechanism, and here we add Laplace distributed noises to the discontinuations of all possible locations of the feature vector $X_i$ and to its label $Y_i$. Again, tight upper bounds on the rate of convergence of the classification error probability are derived, without the strong density assumption, such that this rate depends on $2\,d_a$.
翻译:本文重新审视了经典的划分分类方法,并在宽松条件下研究了其在可观测(非私有化)数据与私有化数据下的收敛速率。设特征向量$X$取值于$\mathbb{R}^d$,其标签记为$Y$。先前关于划分分类器的结果依赖于强密度假设,我们通过简单示例表明该假设具有局限性。我们假设$X$的分布为绝对连续分布与离散分布的混合,且绝对连续分量集中于一个$d_a$维子空间。在此,我们在更温和的假设下研究该问题:除标准Lipschitz条件和边缘条件外,引入绝对连续分量的一个新特征,由此可精确计算二元分类与多标签分类情形下分类错误概率的收敛速率。有趣的是,该收敛速率仅依赖于内在维度$d_a$。隐私约束意味着数据$(X_1,Y_1),\dots,(X_n,Y_n)$无法直接观测,分类器是适当局部差分隐私机制随机化结果的函数。统计学家可自由选择该隐私机制的形式,本文中我们对特征向量$X_i$所有可能位置的离散化结果及其标签$Y_i$添加拉普拉斯分布噪声。再次地,我们在无强密度假设下推导出分类错误概率收敛速率的紧致上界,该速率依赖于$2d_a$。