Data is a cornerstone for fine-tuning large language models, yet acquiring suitable data remains challenging. Challenges encompassed data scarcity, linguistic diversity, and domain-specific content. This paper presents lessons learned while crawling and refining data tailored for fine-tuning Vietnamese language models. Crafting such a dataset, while accounting for linguistic intricacies and striking a balance between inclusivity and accuracy, demands meticulous planning. Our paper presents a multidimensional strategy including leveraging existing datasets in the English language and developing customized data-crawling scripts with the assistance of generative AI tools. A fine-tuned LLM model for the Vietnamese language, which was produced using resultant datasets, demonstrated good performance while generating Vietnamese news articles from prompts. The study offers practical solutions and guidance for future fine-tuning models in languages like Vietnamese.
翻译:数据是大语言模型微调的基石,但获取合适的数据仍面临挑战,包括数据稀缺、语言多样性和领域特定内容等问题。本文总结了在爬取和精炼越南语模型微调专用数据过程中积累的经验。在考虑语言复杂性、兼顾包容性与准确性以构建此类数据集时,需要周密的规划。本文提出了一种多维策略,包括利用现有英文数据集,以及借助生成式AI工具开发定制化数据爬取脚本。基于所生成数据集微调得到的越南语大语言模型,在根据提示生成越南语新闻文章时展现出优异性能。本研究为未来面向越南语等语言模型的微调工作提供了实用解决方案与指导。