When students make a mistake in an exercise, they can consolidate it by ``similar exercises'' which have the same concepts, purposes and methods. Commonly, for a certain subject and study stage, the size of the exercise bank is in the range of millions to even tens of millions, how to find similar exercises for a given exercise becomes a crucial technical problem. Generally, we can assign a variety of explicit labels to the exercise, and then query through the labels, but the label annotation is time-consuming, laborious and costly, with limited precision and granularity, so it is not feasible. In practice, we define ``similar exercises'' as a retrieval process of finding a set of similar exercises based on recall, ranking and re-rank procedures, called the \textbf{FSE} problem (Finding similar exercises). Furthermore, comprehensive representation of the semantic information of exercises was obtained through representation learning. In addition to the reasonable architecture, we also explore what kind of tasks are more conducive to the learning of exercise semantic information from pre-training and supervised learning. It is difficult to annotate similar exercises and the annotation consistency among experts is low. Therefore this paper also provides solutions to solve the problem of low-quality annotated data. Compared with other methods, this paper has obvious advantages in both architecture rationality and algorithm precision, which now serves the daily teaching of hundreds of schools.
翻译:学生在练习中出错时,可通过具有相同概念、目的和方法的"相似习题"进行巩固。通常情况下,针对特定学科与学习阶段,习题库的规模可达百万级甚至千万级,如何为给定习题找到相似习题成为关键技术问题。一般可通过为习题标注多种显式标签进行查询,但标签标注工作耗时费力且成本高昂,精度与粒度有限,因此并不可行。在实践中,我们将"相似习题"定义为基于召回、排序与重排序流程检索相似习题集合的过程,称为**FSE问题**(寻找相似习题)。进一步地,通过表示学习获取了习题语义信息的综合表征。除合理架构外,我们还探索了何种任务更有利于通过预训练与监督学习获取习题语义信息。相似习题标注难度大且专家间标注一致性较低,故本文同时提供了解决低质量标注数据问题的方案。相比其他方法,本文在架构合理性与算法精度上均具明显优势,目前已服务于数百所学校的日常教学。