This study explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance objectivity in organizational task performance evaluations. Through comparative analyses across two studies, including various task performance outputs, we demonstrate that LLMs can serve as a reliable and even superior alternative to human raters in evaluating knowledge-based performance outputs, which are a key contribution of knowledge workers. Our results suggest that GPT ratings are comparable to human ratings but exhibit higher consistency and reliability. Additionally, combined multiple GPT ratings on the same performance output show strong correlations with aggregated human performance ratings, akin to the consensus principle observed in performance evaluation literature. However, we also find that LLMs are prone to contextual biases, such as the halo effect, mirroring human evaluative biases. Our research suggests that while LLMs are capable of extracting meaningful constructs from text-based data, their scope is currently limited to specific forms of performance evaluation. By highlighting both the potential and limitations of LLMs, our study contributes to the discourse on AI role in management studies and sets a foundation for future research to refine AI theoretical and practical applications in management.
翻译:本研究探讨了大型语言模型(LLMs),特别是GPT-4,在提升组织任务绩效评估客观性方面的潜力。通过两项研究的比较分析,涵盖多种任务绩效产出,我们证明LLMs在评估知识型绩效产出(知识工作者的核心贡献)时,可作为人类评估者可靠甚至更优的替代方案。我们的结果表明,GPT评分与人类评分具有可比性,但表现出更高的一致性和可靠性。此外,同一绩效产出的多个GPT组合评分与汇总的人类绩效评分呈现强相关性,类似于绩效评估文献中观察到的共识原则。然而,我们也发现LLMs容易受到情境偏见的影响,例如光环效应,这反映了人类评估的固有偏差。我们的研究表明,尽管LLMs能够从文本数据中提取有意义的构念,但其当前适用范围仍局限于特定形式的绩效评估。通过揭示LLMs的潜力与局限,本研究为人工智能在管理研究中的角色讨论作出贡献,并为未来完善人工智能在管理领域的理论与实际应用奠定了基础。