This paper proposes the task of automatic assessment of Sentence Translation Exercises (STEs), that have been used in the early stage of L2 language learning. We formalize the task as grading student responses for each rubric criterion pre-specified by the educators. We then create a dataset for STE between Japanese and English including 21 questions, along with a total of 3, 498 student responses (167 on average). The answer responses were collected from students and crowd workers. Using this dataset, we demonstrate the performance of baselines including finetuned BERT and GPT models with few-shot in-context learning. Experimental results show that the baseline model with finetuned BERT was able to classify correct responses with approximately 90% in F1, but only less than 80% for incorrect responses. Furthermore, the GPT models with few-shot learning show poorer results than finetuned BERT, indicating that our newly proposed task presents a challenging issue, even for the stateof-the-art large language models.
翻译:本文提出自动评估句子翻译练习(STE)的任务,这类练习常用于第二语言学习的初级阶段。我们将该任务形式化为根据教育者预先设定的每个评分标准对学生回答进行评分。我们创建了一个包含21道题目的日英句子翻译练习数据集,共收集了3498份学生回答(平均167份)。回答数据来自学生和众包工作者。利用该数据集,我们展示了包括微调BERT和基于少样本上下文学习的GPT模型在内的基线方法的性能。实验结果表明,微调BERT基线模型对正确回答的F1值达到约90%,但对错误回答的F1值低于80%。此外,采用少样本学习的GPT模型表现差于微调BERT,表明我们提出的新任务即使对最先进的大语言模型而言也构成挑战。