We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
翻译:我们提出了Slam,一种在单张学术级GPU上24小时内训练高质量语音语言模型(SLMs)的方案。我们通过对模型初始化与架构、合成训练数据、基于合成数据的偏好优化以及所有其他组件的精细调整进行实证分析来实现这一目标。我们通过实验证明,该训练方案在增加计算资源时同样具有良好的扩展性,能以远低于领先SLMs的计算成本获得与之相当的结果。我们希望这些发现能够降低SLM训练与研究的门槛。在SLM缩放定律的背景下,我们的结果远超预测的计算最优性能,为SLM的可行性提供了乐观前景。代码、数据、模型及样本请访问:https://pages.cs.huji.ac.il/adiyoss-lab/slamming。