Deep neural networks (DNNs) trained with the logistic loss (i.e., the cross entropy loss) have made impressive advancements in various binary classification tasks. However, generalization analysis for binary classification with DNNs and logistic loss remains scarce. The unboundedness of the target function for the logistic loss is the main obstacle to deriving satisfying generalization bounds. In this paper, we aim to fill this gap by establishing a novel and elegant oracle-type inequality, which enables us to deal with the boundedness restriction of the target function, and using it to derive sharp convergence rates for fully connected ReLU DNN classifiers trained with logistic loss. In particular, we obtain optimal convergence rates (up to log factors) only requiring the H\"older smoothness of the conditional class probability $\eta$ of data. Moreover, we consider a compositional assumption that requires $\eta$ to be the composition of several vector-valued functions of which each component function is either a maximum value function or a H\"older smooth function only depending on a small number of its input variables. Under this assumption, we derive optimal convergence rates (up to log factors) which are independent of the input dimension of data. This result explains why DNN classifiers can perform well in practical high-dimensional classification problems. Besides the novel oracle-type inequality, the sharp convergence rates given in our paper also owe to a tight error bound for approximating the natural logarithm function near zero (where it is unbounded) by ReLU DNNs. In addition, we justify our claims for the optimality of rates by proving corresponding minimax lower bounds. All these results are new in the literature and will deepen our theoretical understanding of classification with DNNs.
翻译:使用逻辑损失(即交叉熵损失)训练的深度神经网络(DNN)在各种二分类任务中取得了令人瞩目的进展。然而,关于DNN与逻辑损失在二分类中的泛化分析仍然稀缺。逻辑损失目标函数的无界性是推导令人满意的泛化界的主要障碍。本文旨在填补这一空白,通过建立一种新颖而优雅的预言机型不等式,使我们能够处理目标函数的有界性限制,并利用该不等式推导出使用逻辑损失训练的完全连接ReLU DNN分类器的尖锐收敛速率。特别地,我们仅需数据条件类概率$\eta$的Hölder光滑性,便获得了最优收敛速率(至多相差对数因子)。此外,我们考虑了一个组合假设,要求$\eta$是多个向量值函数的复合,其中每个分量函数要么是最大值函数,要么是仅依赖少数输入变量的Hölder光滑函数。在此假设下,我们推导出了与数据输入维度无关的最优收敛速率(至多相差对数因子)。这一结果解释了为何DNN分类器能在实际高维分类问题中表现良好。除了新颖的预言机型不等式,我们论文中给出的尖锐收敛速率还得益于用ReLU DNN近似自然对数函数在零点附近(在该点处函数无界)的紧致误差界。此外,我们通过证明相应的极小化极大下界来验证速率的优性主张。所有这些结果在文献中均为首次提出,并将加深我们对DNN分类的理论理解。