Kolmogorov--Arnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order $1/T$ and a generalization rate of order $1/n$, with $T$ denoting the number of GD iterations and $n$ the sample size. In the private setting, we characterize the noise required for $(ε,δ)$-DP and obtain a utility bound of order $\sqrt{d}/(nε)$ (with $d$ the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.
翻译:Kolmogorov-Arnold网络(KANs)近期作为一种结构化替代方案出现,然而其训练动态、泛化能力与隐私特性的理论体系仍不完善。本文分析用于训练双层KANs的梯度下降(GD)方法,推导出刻画其训练动态、泛化能力及差分隐私(DP)下效用的一般性界。作为具体实例,我们将分析特化至NTK可分离假设下的逻辑损失函数,证明当网络宽度为多对数级时,GD即可实现$1/T$量级的优化速率与$1/n$量级的泛化速率,其中$T$表示GD迭代次数,$n$为样本量。在隐私保护场景中,我们刻画了实现$(ε,δ)$-DP所需的噪声量级,并获得$\sqrt{d}/(nε)$量级的效用界($d$为输入维度),这与一般凸Lipschitz问题的经典下界相匹配。我们的结果表明:多对数级宽度在差分隐私下不仅是充分的,更是必要的,这揭示了非隐私训练(仅需充分性)与隐私训练(必要性同时显现)机制之间的本质差异。实验进一步验证了这些理论见解如何指导实际选择,包括网络宽度筛选与早停策略。