The strategy of training the model from scratch in a specific language or domain serves two essential purposes: i) enhancing performance in the particular linguistic or domain context, and ii) ensuring effective tokenization. The main limitation inherent to this approach lies in the associated cost, which can reach six to seven-digit dollar values, depending on the model size and the number of parameters involved. The main solution to overcome the cost challenge is to rely on available pre-trained models, which, despite recent advancements such as the LLaMA and LLaMA-2 models, still demonstrate inefficiency for certain specific domain problems or prove ineffective in scenarios involving conversational memory resources, given the large number of tokens required to represent text. To overcome this issue, we present a methodology named Cabrita, which, as our research demonstrates, successfully addresses the performance and efficient tokenization problem, all at an affordable cost. We believe that this methodology can be applied to any transformer-like architecture model. To validate the study, we conducted continuous pre-training exclusively using Portuguese text on a 3-billion-parameter model known as OpenLLaMA, resulting in a model named openCabrita 3B. The openCabrita 3B also features a new tokenizer that results in a significant reduction in the number of tokens required to represent the text. In our assessment, for few-shot learning tasks, we achieved similar results with this 3B model compared to a traditional continuous pre-training approach as well as to 7B models English pre-trained models.
翻译:摘要:针对特定语言或领域从头训练模型的策略服务于两个核心目标:其一,提升特定语言或领域场景下的性能;其二,确保高效的分词能力。该方法的主要局限性在于高昂的关联成本——根据模型规模及参数量,该成本可达六至七位数美元。克服该成本挑战的主流方案是依赖现有预训练模型,但尽管已有如LLaMA和LLaMA-2模型的最新进展,这些模型在处理特定领域问题或需要对话记忆资源的场景中仍显效率不足,因其需要大量令牌表示文本。为解决此问题,我们提出名为Cabrita的方法论。研究证明,该方法能以可负担的成本成功解决性能与高效分词问题。我们相信该方案可适用于任何类Transformer架构模型。为验证研究,我们采用仅包含葡萄牙语文本的数据集,对名为OpenLLaMA的30亿参数模型进行持续预训练,得到名为openCabrita 3B的新模型。该模型搭载新型分词器,使文本表示所需的令牌数量显著减少。评估显示,在少样本学习任务中,该30亿参数模型与传统持续预训练方法及70亿参数的英语预训练模型取得了相当的成果。