In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth's language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.
翻译:本文介绍法语青少年语料库(French-YMCA corpus),这是一项专为儿童和青少年定制的新型语言资源。构建该语料库的动机十分明确:儿童具有独特的语言需求,其语言能力处于持续发展阶段,与成人存在显著差异。法语青少年语料库包含39,200个文本文件,总计22,471,898个词汇,其独特性体现在多元化的数据来源、规范的语法拼写以及面向所有用户提供开放式在线访问的承诺。该语料库可作为训练理解并预测青少年语言的语言模型的基础,从而提升数字交互质量,确保反馈和建议符合该年龄段用户的认知水平与理解能力。