Cabrita: closing the gap for foreign languages

The strategy of training the model from scratch in a specific language or domain serves two essential purposes: i) enhancing performance in the particular linguistic or domain context, and ii) ensuring effective tokenization. The main limitation inherent to this approach lies in the associated cost, which can reach six to seven-digit dollar values, depending on the model size and the number of parameters involved. The main solution to overcome the cost challenge is to rely on available pre-trained models, which, despite recent advancements such as the LLaMA and LLaMA-2 models, still demonstrate inefficiency for certain specific domain problems or prove ineffective in scenarios involving conversational memory resources, given the large number of tokens required to represent text. To overcome this issue, we present a methodology named Cabrita, which, as our research demonstrates, successfully addresses the performance and efficient tokenization problem, all at an affordable cost. We believe that this methodology can be applied to any transformer-like architecture model. To validate the study, we conducted continuous pre-training exclusively using Portuguese text on a 3-billion-parameter model known as OpenLLaMA, resulting in a model named openCabrita 3B. The openCabrita 3B also features a new tokenizer that results in a significant reduction in the number of tokens required to represent the text. In our assessment, for few-shot learning tasks, we achieved similar results with this 3B model compared to a traditional continuous pre-training approach as well as to 7B models English pre-trained models.

翻译：摘要：针对特定语言或领域从头训练模型的策略服务于两个核心目标：其一，提升特定语言或领域场景下的性能；其二，确保高效的分词能力。该方法的主要局限性在于高昂的关联成本——根据模型规模及参数量，该成本可达六至七位数美元。克服该成本挑战的主流方案是依赖现有预训练模型，但尽管已有如LLaMA和LLaMA-2模型的最新进展，这些模型在处理特定领域问题或需要对话记忆资源的场景中仍显效率不足，因其需要大量令牌表示文本。为解决此问题，我们提出名为Cabrita的方法论。研究证明，该方法能以可负担的成本成功解决性能与高效分词问题。我们相信该方案可适用于任何类Transformer架构模型。为验证研究，我们采用仅包含葡萄牙语文本的数据集，对名为OpenLLaMA的30亿参数模型进行持续预训练，得到名为openCabrita 3B的新模型。该模型搭载新型分词器，使文本表示所需的令牌数量显著减少。评估显示，在少样本学习任务中，该30亿参数模型与传统持续预训练方法及70亿参数的英语预训练模型取得了相当的成果。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日