This study introduces a novel approach for generating high-quality, language-specific chat corpora using a self-chat mechanism. We combine a generator LLM for creating new samples and an embedder LLM to ensure diversity. A new Masked Language Modelling (MLM) model-based quality assessment metric is proposed for evaluating and filtering the corpora. Utilizing the llama2-70b as the generator and a multilingual sentence transformer as embedder, we generate an Italian chat corpus and refine the Fauno corpus, which is based on translated English ChatGPT self-chat data. The refinement uses structural assertions and Natural Language Processing techniques. Both corpora undergo a comprehensive quality evaluation using the proposed MLM model-based quality metric. The Italian LLM fine-tuned with these corpora demonstrates significantly enhanced language comprehension and question-answering skills. The resultant model, cerbero-7b, establishes a new state-of-the-art for Italian LLMs. This approach marks a substantial advancement in the development of language-specific LLMs, with a special emphasis on augmenting corpora for underrepresented languages like Italian.
翻译:本研究提出了一种利用自我对话机制生成高质量、语言特定聊天语料库的新方法。我们结合生成型大语言模型创建新样本,并采用嵌入型大语言模型确保语料多样性。基于掩码语言建模模型提出了一种新的质量评估指标,用于语料库评估与筛选。以llama2-70b作为生成器、多语言句子变换器作为嵌入器构建了意大利语聊天语料库,并对基于翻译自英语ChatGPT自我对话数据的Fauno语料库进行了优化。优化过程采用了结构性断言与自然语言处理技术。两个语料库均通过所提出的MLM模型质量指标进行了全面评估。经这些语料库微调的意大利语大语言模型展现出显著增强的语言理解与问答能力。最终模型cerbero-7b确立了意大利语大语言模型的最新最优水平。该方法标志着语言专用大语言模型领域取得重大突破,尤其为扩充意大利语等低资源语言的语料库提供了新路径。