Conversational moderation of online communities is crucial to maintaining civility for a constructive environment, but it is challenging to scale and harmful to moderators. The inclusion of sophisticated natural language generation modules as a force multiplier to aid human moderators is a tantalizing prospect, but adequate evaluation approaches have so far been elusive. In this paper, we establish a systematic definition of conversational moderation effectiveness grounded on moderation literature and establish design criteria for conducting realistic yet safe evaluation. We then propose a comprehensive evaluation framework to assess models' moderation capabilities independently of human intervention. With our framework, we conduct the first known study of language models as conversational moderators, finding that appropriately prompted models that incorporate insights from social science can provide specific and fair feedback on toxic behavior but struggle to influence users to increase their levels of respect and cooperation.
翻译:在线社区的对话调节对于维持文明建设性环境至关重要,但其规模化实施困难且对调节者造成伤害。将复杂的自然语言生成模块作为辅助人类调节者的力量倍增器具有诱人前景,但目前仍缺乏充分的评估方法。本文基于调节领域文献,系统建立了对话调节有效性的定义,并制定了开展真实且安全评估的设计标准。我们进而提出一个全面的评估框架,用于独立于人类干预地评估模型的调节能力。借助该框架,我们开展了首个已知的语言模型作为对话调节者的研究,发现经过适当提示并融合社会科学洞见的模型,能够对毒性行为提供具体且公平的反馈,但在引导用户提升尊重与合作水平方面效果有限。