The increasing scale and complexity of modern model parameters underscore the importance of pre-trained models. However, deployment often demands architectures of varying sizes, exposing limitations of conventional pre-training and fine-tuning. To address this, we propose SWEET, a self-supervised framework that performs constraint-based pre-training to enable scalable initialization in vision tasks. Instead of pre-training a fixed-size model, we learn a shared weight template and size-specific weight scalers under Tucker-based factorization, which promotes modularity and supports flexible adaptation to architectures with varying depths and widths. Target models are subsequently initialized by composing and reweighting the template through lightweight weight scalers, whose parameters can be efficiently learned from minimal training data. To further enhance flexibility in width expansion, we introduce width-wise stochastic scaling, which regularizes the template along width-related dimensions and encourages robust, width-invariant representations for improved cross-width generalization. Extensive experiments on \textsc{classification}, \textsc{detection}, \textsc{segmentation} and \textsc{generation} tasks demonstrate the state-of-the-art performance of SWEET for initializing variable-sized vision models.
翻译:现代模型参数规模和复杂性的日益增长凸显了预训练模型的重要性。然而,实际部署通常需要不同规模的架构,这暴露了传统预训练与微调方法的局限性。为解决此问题,我们提出SWEET,一个基于约束预训练的自监督框架,以实现视觉任务中的可扩展初始化。该方法并非预训练一个固定尺寸的模型,而是通过基于Tucker分解的方式学习一个共享的权重模板和尺寸特定的权重缩放器,从而提升模块化程度并支持灵活适应不同深度与宽度的架构。目标模型随后通过轻量级权重缩放器对模板进行组合与重加权来初始化,这些缩放器的参数能够从少量训练数据中高效学习。为进一步增强宽度扩展的灵活性,我们引入了宽度方向随机缩放技术,该技术沿宽度相关维度对模板进行正则化,并鼓励学习鲁棒且宽度不变的表示,以提升跨宽度泛化能力。在\textsc{分类}、\textsc{检测}、\textsc{分割}和\textsc{生成}任务上的大量实验表明,SWEET在初始化可变尺寸视觉模型方面达到了最先进的性能。