Alloprof: a new French question-answer education dataset and its use in an information retrieval case study

Teachers and students are increasingly relying on online learning resources to supplement the ones provided in school. This increase in the breadth and depth of available resources is a great thing for students, but only provided they are able to find answers to their queries. Question-answering and information retrieval systems have benefited from public datasets to train and evaluate their algorithms, but most of these datasets have been in English text written by and for adults. We introduce a new public French question-answering dataset collected from Alloprof, a Quebec-based primary and high-school help website, containing 29 349 questions and their explanations in a variety of school subjects from 10 368 students, with more than half of the explanations containing links to other questions or some of the 2 596 reference pages on the website. We also present a case study of this dataset in an information retrieval task. This dataset was collected on the Alloprof public forum, with all questions verified for their appropriateness and the explanations verified both for their appropriateness and their relevance to the question. To predict relevant documents, architectures using pre-trained BERT models were fine-tuned and evaluated. This dataset will allow researchers to develop question-answering, information retrieval and other algorithms specifically for the French speaking education context. Furthermore, the range of language proficiency, images, mathematical symbols and spelling mistakes will necessitate algorithms based on a multimodal comprehension. The case study we present as a baseline shows an approach that relies on recent techniques provides an acceptable performance level, but more work is necessary before it can reliably be used and trusted in a production setting.

翻译：教师和学生日益依赖在线学习资源以补充校内教材。资源广度与深度的增加对学生而言固然有益，但前提是学生能从中找到所需问题的答案。问答系统与信息检索系统虽受益于公开数据集进行算法训练与评估，然而现有数据集多由成年人编制，且以英语文本为主。本研究介绍了一个从魁北克中小学教育辅导网站Alloprof收集的法语问答公开数据集，包含来自10,368名学生的29,349道问题及其对应解答，涵盖多个学科领域，其中半数以上的解答关联其他问题或网站内2,596个参考页面。我们同时展示了该数据集在信息检索任务中的案例研究。数据均采集自Alloprof公共论坛，所有问题及解答均通过适宜性审核，解答内容还额外验证了与问题的相关性。为预测相关文档，我们微调并评估了基于预训练BERT模型的架构。该数据集将助力研究者开发专用于法语教育场景的问答系统、信息检索及其他算法。此外，数据中呈现的语言能力差异、图像、数学符号及拼写错误，将推动基于多模态理解的算法研究。作为基准的案例研究表明，依托当前技术的方法虽能达到可接受的性能水平，但在生产环境中实现可靠应用前仍需进一步探索。