Language models have rapidly evolved, predominantly focusing on English while often neglecting extensive pretraining in other languages. This approach has required initiatives to adapt powerful, English-centric models to other linguistic contexts through finetuning. For Dutch, such a recent endeavour is ``GEITje'' a model originally derived from the English-based Mistral 7B. Building on this fundamental work, the current research extends the capabilities of GEITje by supervised finetuning on newly created high-quality synthetic conversational datasets, along with an additional preference alignment procedure on a synthetic feedback dataset. Both the developed models and the created datasets are openly available.
翻译:语言模型发展迅速,但其预训练主要集中于英语,往往忽略了其他语言的广泛预训练。这种做法促使人们需要通过微调,将强大的、以英语为中心的模型适配到其他语言环境中。对于荷兰语,近期的一项此类尝试是“GEITje”,该模型最初源自基于英语的Mistral 7B。基于这项基础工作,当前研究通过在全新创建的高质量合成对话数据集上进行监督微调,并结合在合成反馈数据集上进行额外的偏好对齐过程,从而扩展了GEITje的能力。所开发的模型及创建的数据集均已公开。