Navigating complex social situations is an integral part of corporate life, ranging from giving critical feedback without hurting morale to rejecting requests without alienating teammates. Although large language models (LLMs) are permeating the workplace, it is unclear how well they can navigate these norms. To investigate this question, we created HR Simulator, a game where users roleplay as an HR officer and write emails to tackle challenging workplace scenarios, evaluated with GPT-4o as a judge based on scenario-specific rubrics. We analyze over 600 human and LLM emails and find systematic differences in style: LLM emails are more formal and empathetic. Furthermore, humans underperform LLMs (e.g., 23.5% vs. 48-54% scenario pass rate), but human emails rewritten by LLMs can outperform both, which indicates a hybrid advantage. On the evaluation side, judges can exhibit differences in their email preferences: an analysis of 10 judge models reveals evidence for emergent tact, where weaker models prefer direct, blunt communication but stronger models prefer more subtle messages. Judges also agree with each other more as they scale, which hints at a convergence toward shared communicative norms that may differ from humans'. Overall, our results suggest LLMs could substantially reshape communication in the workplace if they are widely adopted in professional correspondence.
翻译:应对复杂社交情境是企业生活的重要组成部分,从在不损害士气的前提下给予批评性反馈,到在不疏远团队成员的前提下拒绝请求。尽管大语言模型正逐步渗透职场,但其应对这些社交规范的能力尚不明确。为探究这一问题,我们构建了HR模拟器:用户扮演人力资源官员角色,通过撰写邮件处理具有挑战性的职场场景,并以GPT-4o作为评估器,依据场景特定评分标准进行打分。通过分析600余封人类与LLM生成的邮件,我们发现两者在风格上存在系统性差异:LLM邮件更为正式且富有同理心。此外,人类表现逊于LLM(场景通过率分别为23.5%与48-54%),但经LLM改写的人类邮件可超越两者,表明混合模式具有优势。在评估层面,评审者对邮件偏好存在差异:对10个评审模型的分析揭示了“隐性机智”现象——较弱模型偏好直接坦率的沟通,而较强模型则更青睐含蓄表达。随着模型规模提升,评审者间一致性增强,这暗示着可能异于人类的共享沟通规范正在趋同。总体而言,本研究结果表明,若大语言模型被广泛应用于职业通信,其可能实质性重塑职场沟通方式。