We introduce a new nonparametric framework for classification problems in the presence of missing data. The key aspect of our framework is that the regression function decomposes into an anova-type sum of orthogonal functions, of which some (or even many) may be zero. Working under a general missingness setting, which allows features to be missing not at random, our main goal is to derive the minimax rate for the excess risk in this problem. In addition to the decomposition property, the rate depends on parameters that control the tail behaviour of the marginal feature distributions, the smoothness of the regression function and a margin condition. The ambient data dimension does not appear in the minimax rate, which can therefore be faster than in the classical nonparametric setting. We further propose a new method, called the Hard-thresholding Anova Missing data (HAM) classifier, based on a careful combination of a $k$-nearest neighbour algorithm and a thresholding step. The HAM classifier attains the minimax rate up to polylogarithmic factors and numerical experiments further illustrate its utility.
翻译:我们提出了一种新的非参数框架,用于处理存在缺失数据的分类问题。该框架的核心在于:回归函数可分解为ANOVA型正交函数之和,其中部分(甚至多数)分量为零。在允许特征非随机缺失的广义缺失机制下,我们的主要目标是推导该问题中超额风险的最小最大速率。除分解性质外,该速率还依赖于控制边际特征分布尾部行为、回归函数光滑性及边际条件的参数。由于环境数据维度未出现在最小最大速率中,其收敛速度可优于经典非参数设置。我们进一步提出了一种名为硬阈值ANOVA缺失数据(HAM)分类器的新方法,该方法巧妙结合了$k$近邻算法与阈值步骤。HAM分类器在多项式对数因子范围内达到了最小最大速率,数值实验进一步验证了其有效性。