In this work, we introduce a novel paradigm called Simulated Overparametrization (SOP). SOP merges the computational efficiency of compact models with the advanced learning proficiencies of overparameterized models. SOP proposes a unique approach to model training and inference, where a model with a significantly larger number of parameters is trained in such a way that a smaller, efficient subset of these parameters is used for the actual computation during inference. Building upon this framework, we present a novel, architecture agnostic algorithm called "majority kernels", which seamlessly integrates with predominant architectures, including Transformer models. Majority kernels enables the simulated training of overparameterized models, resulting in performance gains across architectures and tasks. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as combinatorial optimization methods based on submodular optimization.
翻译:本文提出一种名为"模拟过参数化"(SOP)的新范式。SOP将紧凑模型的计算效率与过参数化模型的先进学习能力相结合,提出了一种独特的模型训练与推理方法:通过训练具有显著更多参数的模型,使得推理时仅使用其中较小且高效的子集进行实际计算。基于此框架,我们提出了一种与架构无关的新算法——"多数核"(majority kernels),该算法可无缝集成到包括Transformer模型在内的主流架构中。多数核实现了过参数化模型的模拟训练,能够在不同架构和任务中提升性能。此外,我们的方法在训练阶段仅增加极小的额外时间开销。该方案在多种数据集和模型上展现出优异性能,甚至优于基于次模优化的组合优化方法等强基线模型。