Kolmogorov--Arnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order $1/T$ and a generalization rate of order $1/n$, with $T$ denoting the number of GD iterations and $n$ the sample size. In the private setting, we characterize the noise required for $(ε,δ)$-DP and obtain a utility bound of order $\sqrt{d}/(nε)$ (with $d$ the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.
翻译:Kolmogorov-Arnold网络(KANs)近期作为一种结构化替代方案出现,以取代标准多层感知机,然而关于其训练动态、泛化能力及隐私特性的系统性理论仍较为有限。本文分析用于训练双层KANs的梯度下降(GD)方法,并推导出刻画其在差分隐私(DP)约束下的训练动态、泛化能力及效用的一般性界。作为具体实例,我们将分析特化至满足NTK可分离性假设的逻辑损失函数场景,证明当网络宽度为多对数级时,梯度下降即可实现$1/T$量级的优化速率与$1/n$量级的泛化速率,其中$T$表示梯度下降迭代次数,$n$为样本量。在隐私保护场景中,我们刻画了实现$(ε,δ)$-差分隐私所需的噪声量级,并获得了$\sqrt{d}/(nε)$量级的效用界($d$为输入维度),该结果与一般凸Lipschitz问题的经典下界相匹配。我们的研究结果表明:多对数级宽度在差分隐私约束下不仅是充分的,同时也是必要的,这揭示了非隐私训练(仅需充分性)与隐私训练(必要性同时显现)机制之间的质性差异。实验进一步验证了这些理论见解如何指导实际选择,包括网络宽度筛选与早停策略的制定。