EvalYaks: Instruction Tuning Datasets and LoRA Fine-tuned Models for Automated Scoring of CEFR B2 Speaking Assessment Transcripts

Relying on human experts to evaluate CEFR speaking assessments in an e-learning environment creates scalability challenges, as it limits how quickly and widely assessments can be conducted. We aim to automate the evaluation of CEFR B2 English speaking assessments in e-learning environments from conversation transcripts. First, we evaluate the capability of leading open source and commercial Large Language Models (LLMs) to score a candidate's performance across various criteria in the CEFR B2 speaking exam in both global and India-specific contexts. Next, we create a new expert-validated, CEFR-aligned synthetic conversational dataset with transcripts that are rated at different assessment scores. In addition, new instruction-tuned datasets are developed from the English Vocabulary Profile (up to CEFR B2 level) and the CEFR-SP WikiAuto datasets. Finally, using these new datasets, we perform parameter efficient instruction tuning of Mistral Instruct 7B v0.2 to develop a family of models called EvalYaks. Four models in this family are for assessing the four sections of the CEFR B2 speaking exam, one for identifying the CEFR level of vocabulary and generating level-specific vocabulary, and another for detecting the CEFR level of text and generating level-specific text. EvalYaks achieved an average acceptable accuracy of 96%, a degree of variation of 0.35 levels, and performed 3 times better than the next best model. This demonstrates that a 7B parameter LLM instruction tuned with high-quality CEFR-aligned assessment data can effectively evaluate and score CEFR B2 English speaking assessments, offering a promising solution for scalable, automated language proficiency evaluation.

翻译：在电子学习环境中依赖专家进行CEFR口语评估会带来可扩展性挑战，因其限制了评估实施的效率与覆盖范围。本研究旨在基于对话文本，实现电子学习环境中CEFR B2英语口语评估的自动化评分。首先，我们评估了主流开源与商用大语言模型在全球及印度特定语境下对CEFR B2口语考试各评分维度的评分能力。随后，我们构建了一个经专家验证、与CEFR标准对齐的新型合成对话数据集，其中包含具有不同评分等级的对话文本。此外，基于《英语词汇大纲》（至CEFR B2级别）与CEFR-SP WikiAuto数据集，开发了新的指令微调数据集。最后，利用这些新数据集，我们对Mistral Instruct 7B v0.2进行参数高效的指令微调，开发出名为EvalYaks的系列模型。该系列包含四个分别对应CEFR B2口语考试四个部分的评估模型，一个用于识别词汇CEFR等级并生成对应级别词汇的模型，以及一个用于检测文本CEFR等级并生成对应级别文本的模型。EvalYaks模型取得了96%的平均可接受准确率、0.35个等级的波动幅度，其性能优于次优模型达3倍。这表明采用高质量CEFR对齐评估数据进行指令微调的70亿参数大语言模型，能够有效评估并评分CEFR B2英语口语测试，为可扩展的自动化语言能力评估提供了可行解决方案。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日