We decompose the Kullback--Leibler generalization error (GE) -- the expected KL divergence from the data distribution to the trained model -- of unsupervised learning into three non-negative components: model error, data bias, and variance. The decomposition is exact for any e-flat model class and follows from two identities of information geometry: the generalized Pythagorean theorem and a dual e-mixture variance identity. As an analytically tractable demonstration, we apply the framework to $ε$-PCA, a regularized principal component analysis in which the empirical covariance is truncated at rank $N_K$ and discarded directions are pinned at a fixed noise floor $ε$. Although rank-constrained $ε$-PCA is not itself e-flat, it admits a technical reformulation with the same total GE on isotropic Gaussian data, under which each component of the decomposition takes closed form. The optimal rank emerges as the cutoff $λ_{\mathrm{cut}}^{*} = ε$ -- the model retains exactly those empirical eigenvalues exceeding the noise floor -- with the cutoff reflecting a marginal-rate balance between model-error gain and data-bias cost. A boundary comparison further yields a three-regime phase diagram -- retain-all, interior, and collapse -- separated by the lower Marchenko--Pastur edge and an analytically computable collapse threshold $ε_{*}(α)$, where $α$ is the dimension-to-sample-size ratio. All claims are verified numerically.
翻译:我们将无监督学习的KL泛化误差(GE)——即从数据分布到训练模型的期望KL散度——分解为三个非负分量:模型误差、数据偏差和方差。该分解对于任意e-平坦模型类都是精确的,且遵循信息几何的两个恒等式:广义勾股定理和对偶e-混合方差恒等式。作为可解析推导的示例,我们将该框架应用于ε-PCA(一种正则化主成分分析),其中经验协方差在秩N_K处截断,被丢弃的方向固定在噪声基底ε处。尽管秩约束的ε-PCA本身并非e-平坦,但在各向同性高斯数据上,它允许通过技术重整化使得总GE保持不变,且分解的每个分量均可表示为闭式解。最优秩由截断阈值λ_{\mathrm{cut}}^{*}=ε决定——模型仅保留那些超过噪声基底的经验特征值——该阈值反映了模型误差增益与数据偏差成本之间的边际率平衡。进一步的边界比较给出了由下马尔琴科-帕斯图尔边缘和解析可计算的坍缩阈值ε_{*}(α)(其中α为维数与样本量之比)划分的三阶段相图:保留全部、内部区和坍缩区。所有结论均通过数值验证。