Relying on human experts to evaluate CEFR speaking assessments in an e-learning environment creates scalability challenges, as it limits how quickly and widely assessments can be conducted. We aim to automate the evaluation of CEFR B2 English speaking assessments in e-learning environments from conversation transcripts. First, we evaluate the capability of leading open source and commercial Large Language Models (LLMs) to score a candidate's performance across various criteria in the CEFR B2 speaking exam in both global and India-specific contexts. Next, we create a new expert-validated, CEFR-aligned synthetic conversational dataset with transcripts that are rated at different assessment scores. In addition, new instruction-tuned datasets are developed from the English Vocabulary Profile (up to CEFR B2 level) and the CEFR-SP WikiAuto datasets. Finally, using these new datasets, we perform parameter efficient instruction tuning of Mistral Instruct 7B v0.2 to develop a family of models called EvalYaks. Four models in this family are for assessing the four sections of the CEFR B2 speaking exam, one for identifying the CEFR level of vocabulary and generating level-specific vocabulary, and another for detecting the CEFR level of text and generating level-specific text. EvalYaks achieved an average acceptable accuracy of 96%, a degree of variation of 0.35 levels, and performed 3 times better than the next best model. This demonstrates that a 7B parameter LLM instruction tuned with high-quality CEFR-aligned assessment data can effectively evaluate and score CEFR B2 English speaking assessments, offering a promising solution for scalable, automated language proficiency evaluation.
翻译:在电子学习环境中依赖专家进行CEFR口语评估会带来可扩展性挑战,因其限制了评估实施的效率与覆盖范围。本研究旨在基于对话文本,实现电子学习环境中CEFR B2英语口语评估的自动化评分。首先,我们评估了主流开源与商用大语言模型在全球及印度特定语境下对CEFR B2口语考试各评分维度的评分能力。随后,我们构建了一个经专家验证、与CEFR标准对齐的新型合成对话数据集,其中包含具有不同评分等级的对话文本。此外,基于《英语词汇大纲》(至CEFR B2级别)与CEFR-SP WikiAuto数据集,开发了新的指令微调数据集。最后,利用这些新数据集,我们对Mistral Instruct 7B v0.2进行参数高效的指令微调,开发出名为EvalYaks的系列模型。该系列包含四个分别对应CEFR B2口语考试四个部分的评估模型,一个用于识别词汇CEFR等级并生成对应级别词汇的模型,以及一个用于检测文本CEFR等级并生成对应级别文本的模型。EvalYaks模型取得了96%的平均可接受准确率、0.35个等级的波动幅度,其性能优于次优模型达3倍。这表明采用高质量CEFR对齐评估数据进行指令微调的70亿参数大语言模型,能够有效评估并评分CEFR B2英语口语测试,为可扩展的自动化语言能力评估提供了可行解决方案。