In this work, we present a comprehensive exploration of finetuning Malaysian language models, specifically Llama2 and Mistral, on embedding tasks involving negative and positive pairs. We release two distinct models tailored for Semantic Similarity and Retrieval-Augmented Generation (RAG). For Semantic Similarity, our 600 million parameter Llama2 model outperforms OpenAI text-embedding-ada-002 across all recall@k metrics for b.cari.com.my, c.cari.com.my, Malay news, and Malaysian Twitter test sets. In the realm of RAG models, our approach proves competitive with OpenAI text-embedding-ada-002 in the Malaysian context. Notably, our 2 billion parameter Llama2 model achieves superior Recall@5, Recall@10 for the "Melayu" keyword research papers dataset and excels in Recall@3, Recall@5, and Recall@10 for the lom.agc.gov.my dataset. These findings underscore the effectiveness of our finetuning strategy and highlight the performance gains in both Semantic Similarity and RAG tasks. All models released at https://huggingface.co/collections/mesolitica/malaysian-embedding-6523612bfe5881ad35f81b99
翻译:在本工作中,我们全面探讨了针对马来西亚语言模型(特别是Llama2和Mistral)的微调,涉及包含正负样本对的嵌入任务。我们发布了两个针对语义相似性与检索增强生成(RAG)场景定制的独立模型。在语义相似性方面,我们的6亿参数Llama2模型在b.cari.com.my、c.cari.com.my、马来语新闻和马来西亚推特测试集上,所有recall@k指标均优于OpenAI text-embedding-ada-002。在RAG模型领域,我们的方法在马来西亚语境下与OpenAI text-embedding-ada-002相比展现出竞争力。值得注意的是,我们的20亿参数Llama2模型在"Melayu"关键词研究论文数据集上实现了更优的Recall@5、Recall@10,并在lom.agc.gov.my数据集上于Recall@3、Recall@5和Recall@10方面表现卓越。这些发现证实了我们微调策略的有效性,并突显了在语义相似性与RAG任务中的性能提升。所有模型已发布于https://huggingface.co/collections/mesolitica/malaysian-embedding-6523612bfe5881ad35f81b99