Alloprof: a new French question-answer education dataset and its use in an information retrieval case study

Teachers and students are increasingly relying on online learning resources to supplement the ones provided in school. This increase in the breadth and depth of available resources is a great thing for students, but only provided they are able to find answers to their queries. Question-answering and information retrieval systems have benefited from public datasets to train and evaluate their algorithms, but most of these datasets have been in English text written by and for adults. We introduce a new public French question-answering dataset collected from Alloprof, a Quebec-based primary and high-school help website, containing 29 349 questions and their explanations in a variety of school subjects from 10 368 students, with more than half of the explanations containing links to other questions or some of the 2 596 reference pages on the website. We also present a case study of this dataset in an information retrieval task. This dataset was collected on the Alloprof public forum, with all questions verified for their appropriateness and the explanations verified both for their appropriateness and their relevance to the question. To predict relevant documents, architectures using pre-trained BERT models were fine-tuned and evaluated. This dataset will allow researchers to develop question-answering, information retrieval and other algorithms specifically for the French speaking education context. Furthermore, the range of language proficiency, images, mathematical symbols and spelling mistakes will necessitate algorithms based on a multimodal comprehension. The case study we present as a baseline shows an approach that relies on recent techniques provides an acceptable performance level, but more work is necessary before it can reliably be used and trusted in a production setting.

翻译：教师和学生日益依赖在线学习资源来补充学校提供的教学内容。可获取资源广度和深度的增加对学生大有裨益，但这仅在他们能够找到所需问题答案的前提下成立。问答系统和信息检索系统受益于公开数据集来训练和评估其算法，但这些数据集大多由成年人用英语编写，且面向成年人。我们推出一个新的法语问答公开数据集，该数据集源自魁北克中小学辅导网站Alloprof，包含来自10,368名学生的29,349个问题及其在多学科中的解释，其中超过半数的解释包含指向其他问题或网站内2,596个参考页面的链接。我们还在此数据集上开展了一项信息检索任务的案例研究。该数据集采集自Alloprof公共论坛，所有问题均经过适用性验证，解释则同时验证了其适用性和与问题的相关性。为预测相关文档，我们微调并评估了基于预训练BERT模型的多种架构。该数据集将使研究人员能够专门针对法语教育情境开发问答、信息检索及其他算法。此外，语言能力差异、图像、数学符号及拼写错误的存在将要求算法具备多模态理解能力。我们作为基准线呈现的案例研究表明，依赖最新技术的方法可提供可接受的性能水平，但在其可靠用于并信任于生产环境之前，仍需开展更多工作。