On the application of Large Language Models for language teaching and assessment technology

Andrew Caines,Luca Benedetto,Shiva Taslimipoor,Christopher Davis,Yuan Gao,Oeistein Andersen,Zheng Yuan,Mark Elliott,Russell Moore,Christopher Bryant,Marek Rei,Helen Yannakoudakis,Andrew Mullooly,Diane Nicholls,Paula Buttery

from arxiv, Accepted at the AIED2023 workshop: Empowering Education with LLMs - the Next-Gen Interface and Content Generation

The recent release of very large language models such as PaLM and GPT-4 has made an unprecedented impact in the popular media and public consciousness, giving rise to a mixture of excitement and fear as to their capabilities and potential uses, and shining a light on natural language processing research which had not previously received so much attention. The developments offer great promise for education technology, and in this paper we look specifically at the potential for incorporating large language models in AI-driven language teaching and assessment systems. We consider several research areas and also discuss the risks and ethical considerations surrounding generative AI in education technology for language learners. Overall we find that larger language models offer improvements over previous models in text generation, opening up routes toward content generation which had not previously been plausible. For text generation they must be prompted carefully and their outputs may need to be reshaped before they are ready for use. For automated grading and grammatical error correction, tasks whose progress is checked on well-known benchmarks, early investigations indicate that large language models on their own do not improve on state-of-the-art results according to standard evaluation metrics. For grading it appears that linguistic features established in the literature should still be used for best performance, and for error correction it may be that the models can offer alternative feedback styles which are not measured sensitively with existing methods. In all cases, there is work to be done to experiment with the inclusion of large language models in education technology for language learners, in order to properly understand and report on their capacities and limitations, and to ensure that foreseeable risks such as misinformation and harmful bias are mitigated.

翻译：近期发布的超大规模语言模型（如PaLM和GPT-4）在大众媒体和公众意识中产生了前所未有的影响，引发了对其能力与潜在用途的兴奋与担忧，也使自然语言处理研究前所未有地受到关注。这些发展为教育技术带来了巨大前景，本文特别聚焦将大型语言模型整合到AI驱动的语言教学与评估系统中的潜力。我们探讨了多个研究领域，并讨论了生成式AI在语言学习者教育技术中的风险与伦理考量。总体而言，我们发现大型语言模型在文本生成方面较以往模型有所改进，开辟了此前难以实现的内容生成路径。在文本生成中，需谨慎设计提示词，且其输出可能需要重塑方可使用。对于自动评分和语法纠错等需通过公认基准检验进展的任务，初步研究表明，根据标准评估指标，大型语言模型本身并未超越现有最优结果。在评分方面，文献中确立的语言特征仍应被用于获得最佳性能；在纠错方面，模型可能提供的替代反馈风格难以通过现有方法精确衡量。在所有情况下，均需开展实验将大型语言模型纳入语言学习者的教育技术中，以充分理解并报告其能力与局限性，确保缓解误导信息和有害偏见等可预见的风险。