We present the Claire French Dialogue Dataset (CFDD), a resource created by members of LINAGORA Labs in the context of the OpenLLM France initiative. CFDD is a corpus containing roughly 160 million words from transcripts and stage plays in French that we have assembled and publicly released in an effort to further the development of multilingual, open source language models. This paper describes the 24 individual corpora of which CFDD is composed and provides links and citations to their original sources. It also provides our proposed breakdown of the full CFDD dataset into eight categories of subcorpora and describes the process we followed to standardize the format of the final dataset. We conclude with a discussion of similar work and future directions.
翻译:我们介绍了克莱尔法语对话数据集(CFDD),这是LINAGORA实验室成员在法国OpenLLM倡议框架下创建的数据资源。CFDD包含约1.6亿个来自法语剧本和舞台剧本的词汇,我们整理并公开了该语料库,旨在推动多语言开源语言模型的发展。本文详细描述了组成CFDD的24个子语料库,并提供了其原始来源的链接与引用。此外,我们提出了将CFDD完整数据集划分为八个子语料库类别的方案,并阐述了标准化最终数据集格式的流程。最后,我们讨论了相关研究工作及未来发展方向。