We revisit data selection in a modern context of finetuning from a fundamental perspective. Extending the classical wisdom of variance minimization in low dimensions to high-dimensional finetuning, our generalization analysis unveils the importance of additionally reducing bias induced by low-rank approximation. Inspired by the variance-bias tradeoff in high dimensions from the theory, we introduce Sketchy Moment Matching (SkMM), a scalable data selection scheme with two stages. (i) First, the bias is controlled using gradient sketching that explores the finetuning parameter space for an informative low-dimensional subspace $\mathcal{S}$; (ii) then the variance is reduced over $\mathcal{S}$ via moment matching between the original and selected datasets. Theoretically, we show that gradient sketching is fast and provably accurate: selecting $n$ samples by reducing variance over $\mathcal{S}$ preserves the fast-rate generalization $O(\dim(\mathcal{S})/n)$, independent of the parameter dimension. Empirically, we concretize the variance-bias balance via synthetic experiments and demonstrate the effectiveness of SkMM for finetuning in real vision tasks.
翻译:我们从基础视角重新审视现代微调背景下的数据选择问题。将经典的低维方差最小化思想推广至高维微调场景,我们的泛化分析揭示了降低低秩近似所引入偏差的重要性。受理论中高维方差-偏差权衡的启发,我们提出了草图矩匹配(SkMM)——一种包含两个阶段的可扩展数据选择方案。(i)首先,通过梯度草图技术探索微调参数空间以获取信息丰富的低维子空间$\mathcal{S}$,从而控制偏差;(ii)随后通过原始数据集与选定数据集在$\mathcal{S}$上的矩匹配来降低方差。理论上,我们证明梯度草图方法具有快速且可证明的准确性:通过在$\mathcal{S}$上降低方差选择$n$个样本,能够保持$O(\dim(\mathcal{S})/n)$的快速泛化率,且该速率独立于参数维度。实证方面,我们通过合成实验具体呈现方差-偏差平衡机制,并在实际视觉任务中验证了SkMM对微调的有效性。