Assessing soft skills such as empathy, ethical judgment, and communication is essential in competitive selection processes, yet human scoring is often inconsistent and biased. While Large Language Models (LLMs) have improved Automated Essay Scoring (AES), we show that state-of-the-art rationale-based fine-tuning methods struggle with the abstract, context-dependent nature of Multiple Mini-Interviews (MMIs), missing the implicit signals embedded in candidate narratives. We introduce a multi-agent prompting framework that breaks down the evaluation process into transcript refinement and criterion-specific scoring. Using 3-shot in-context learning with a large instruct-tuned model, our approach outperforms specialised fine-tuned baselines (Avg QWK 0.62 vs 0.32) and achieves reliability comparable to human experts. We further demonstrate the generalisability of our framework on the ASAP benchmark, where it rivals domain-specific state-of-the-art models without additional training. These findings suggest that for complex, subjective reasoning tasks, structured prompt engineering may offer a scalable alternative to data-intensive fine-tuning, altering how LLMs can be applied to automated assessment.
翻译:在竞争性选拔过程中,评估共情能力、伦理判断和沟通技巧等软技能至关重要,但人工评分往往存在不一致性和偏见。尽管大型语言模型(LLMs)已改进了自动化作文评分(AES),我们发现当前最先进的基于推理的微调方法难以应对多站迷你面试(MMIs)中抽象且依赖情境的特性,未能捕捉候选人叙述中隐含的信号。我们提出一种多智能体提示框架,将评估过程分解为文本精炼和标准专项评分。通过使用大型指令调优模型进行3样本上下文学习,我们的方法超越了专用微调基线(平均QWK 0.62对比0.32),并达到与人类专家相当的可靠性。我们进一步在ASAP基准上验证了该框架的泛化能力,其表现可与领域内最先进的专用模型相媲美,且无需额外训练。这些发现表明,对于复杂的主观推理任务,结构化提示工程可能为数据密集型微调提供可扩展的替代方案,从而改变LLMs在自动化评估中的应用方式。