Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.

翻译：从有限数据中学习泛化是人工系统与生物系统共同面临的基础挑战。一种常见策略是从大量无标注数据中提取可复用结构，从而能够利用少量标注数据高效适应新任务。这种两阶段范式已成为现代训练流程的标准配置——先进行预训练，再实施微调或线性探测。我们为此过程建立了分析模型：将结构提取形式化为对无标注数据的主成分分析，将下游学习形式化为对独立标注数据集的线性回归。在高维框架下，我们推导出训练误差与泛化误差的精确表达式，揭示了其与表征维度、无标注及标注样本量、任务对齐度的依赖关系。研究结果表明，预训练表征对下游泛化具有显著影响，我们刻画了表征最优维度随任务参数的变化规律：当预训练数据充足但下游数据稀缺时，最大压缩表征达到最优；而预训练数据有限时，高维表征的泛化效果更佳。在此基础上，我们建立了预训练与监督学习之间的精确权衡关系，量化了替代单个标注样本所需的无标注数据量。除了理想化模型，我们在自编码器与预训练大语言模型中也观察到类似现象。总体上，我们强调优化表征维度具有关键意义，揭示了预训练阶段进行压缩能够改善泛化能力的条件。