Recently, significant public efforts have been directed towards developing low-cost models with capabilities akin to ChatGPT, thereby fostering the growth of open-source conversational models. However, there remains a scarcity of comprehensive and in-depth evaluations of these models' performance. In this study, we examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance. Our analysis is grounded in several publicly accessible, high-quality instruction datasets, as well as our own Chinese multi-turn conversations. We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios. Our goal is to supplement manual evaluations with quantitative analyses, offering valuable insights for the continued advancement of open-source chat models. Furthermore, to enhance the performance and training and inference efficiency of models in the Chinese domain, we extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3 - and conduct secondary pre-training on 3.4B Chinese words. We make our model, data, as well as code publicly available.
翻译:近期,大量公共资源投入开发类似ChatGPT能力的低成本模型,从而推动了开源对话模型的进展。然而,对这些模型性能进行全面而深入的评估仍然匮乏。本研究审视了包括数量、质量与语言分布在内的训练数据因素对模型性能的影响。我们的分析基于多个可公开获取的高质量指令数据集,以及自建的中文多轮对话数据。我们使用包含1000个样本、覆盖九种真实场景的评估集对多种模型进行了评估。目标在于通过定量分析补充人工评估,为开源聊天模型的持续改进提供宝贵见解。此外,为提升模型在中文领域的性能及训练推理效率,我们扩展了LLaMA(开源性能最接近GPT-3等专有语言模型的模型)的词表,并在34亿中文词汇上进行二次预训练。我们将模型、数据及代码公开。