In this paper we revisit the classical method of partitioning classification and prove novel convergence rates under relaxed conditions, both for observable (non-privatised) and for privatised data. We consider the problem of classification in a $d$ dimensional Euclidean space. Previous results on the partitioning classifier worked with the strong density assumption (SDA), which is restrictive, as we demonstrate through simple examples. Here, we study the problem under much milder assumptions. We presuppose that the distribution of the inputs is a mixture of an absolutely continuous and a discrete distribution, such that the absolutely continuous component is concentrated on a $d_a$ dimensional subspace. In addition to the standard Lipschitz and margin conditions, a novel characteristic of the absolutely continuous component is introduced, by which the convergence rate of the classification error probability is computed, both for the binary and for the multi-class cases. This bound can reach the minimax optimal convergence rate achievable using SDA, but under much milder distributional assumptions. Interestingly, this convergence rate depends only on the intrinsic dimension of the continuous inputs, $d_a$, and not on $d$. Under privacy constraints, the data cannot be directly observed, and the constructed classifiers are functions of the randomised outcome of a suitable local differential privacy mechanism. In this paper we add Laplace distributed noises to the discretisations of all possible locations of the feature vector and to its label. Again, tight upper bounds on the convergence rate of the classification error probability can be derived, without using SDA, such that this rate depends on $2d_a$.
翻译:本文重新审视经典的划分分类方法,并在弱化条件下针对可观测(非私有化)数据与私有化数据证明了新的收敛速率。我们考虑$d$维欧几里得空间中的分类问题。先前关于划分分类器的结果依赖于强密度假设(SDA),该假设具有限制性,我们通过简单示例予以说明。本文在更温和的假设下研究该问题。我们预设输入分布是绝对连续分布与离散分布的混合,且绝对连续分量集中于一个$d_a$维子空间。除标准的Lipschitz条件与边界条件外,我们引入绝对连续分量的新特征,并据此计算二元分类与多类分类情形下分类错误概率的收敛速率。该上界可达到SDA假设下的极小极大最优收敛速率,但仅在更温和的分布假设条件下成立。有趣的是,该收敛速率仅取决于连续输入的内在维度$d_a$,而非$d$。在隐私约束下,数据无法直接观测,所构建的分类器是适当局部差分隐私机制随机化结果的函数。本文对特征向量所有可能位置的离散化结果及其标签添加拉普拉斯分布噪声。同样地,无需SDA假设即可导出分类错误概率收敛速率的紧上界,且该速率取决于$2d_a$。