Large language models have emerged as a versatile tool but are challenging to apply to tasks lacking large inference budgets and large in-domain training sets. This work formalizes these constraints and distinguishes four important variables: the pretraining budget (for training before the target domain is known), the specialization budget (for training after the target domain is known), the inference budget, and the in-domain training set size. Across these settings, we compare different approaches from the machine learning literature. Limited by inference cost, we find better alternatives to the standard practice of training very large vanilla transformer models. In particular, we show that hyper-networks and mixture of experts have better perplexity for large pretraining budgets, while small models trained on importance sampled datasets are attractive for large specialization budgets.
翻译:大型语言模型已成为一种通用工具,但在推理预算和领域内训练集有限的任务中难以应用。本文形式化了这些约束条件,并区分了四个重要变量:预训练预算(目标领域确定前的训练预算)、专用化预算(目标领域确定后的训练预算)、推理预算以及领域内训练集规模。针对这些设定,我们比较了机器学习文献中的不同方法。在推理成本受限的条件下,我们发现传统的大规模标准Transformer模型训练并非最优方案。具体而言,我们证明超网络和混合专家模型在预训练预算较大时具有更好的困惑度,而基于重要性采样数据集训练的小模型在专用化预算较大时更具吸引力。