Large Language Models (LLMs) are increasingly being used in educational and learning applications. Research has demonstrated that controlling for style, to fit the needs of the learner, fosters increased understanding, promotes inclusion, and helps with knowledge distillation. To understand the capabilities and limitations of contemporary LLMs in style control, we evaluated five state-of-the-art models: GPT-3.5, GPT-4, GPT-4o, Llama-3, and Mistral-instruct- 7B across two style control tasks. We observed significant inconsistencies in the first task, with model performances averaging between 5th and 8th grade reading levels for tasks intended for first-graders, and standard deviations up to 27.6. For our second task, we observed a statistically significant improvement in performance from 0.02 to 0.26. However, we find that even without stereotypes in reference texts, LLMs often generated culturally insensitive content during their tasks. We provide a thorough analysis and discussion of the results.
翻译:大型语言模型(LLM)正日益广泛地应用于教育和学习领域。研究表明,根据学习者的需求进行风格控制能够促进理解、增强包容性并助力知识提炼。为探究当代LLM在风格控制方面的能力与局限,我们评估了五种先进模型:GPT-3.5、GPT-4、GPT-4o、Llama-3和Mistral-instruct-7B在两项风格控制任务中的表现。在第一项任务中,我们观察到显著的不一致性:针对一年级学生设计的任务,模型表现平均仅达到5至8年级阅读水平,标准差高达27.6。在第二项任务中,模型性能出现统计学显著提升(从0.02提高至0.26)。然而研究发现,即使参考文本不含刻板印象,LLM在执行任务时仍频繁生成文化不敏感内容。本文对实验结果进行了全面分析与讨论。