Online content moderation is essential for maintaining a healthy digital environment, and reliance on AI for this task continues to grow. Consider a user comment using national stereotypes to insult a politician. This example illustrates two critical challenges in real-world scenarios: (1) Co-occurring Violations, where a single post violates multiple policies (e.g., prejudice and personal attacks); (2) Dynamic rules of moderation, where determination of a violation depends on platform-specific guidelines that evolve across contexts . The intersection of co-occurring harms and dynamically changing rules highlights a core limitation of current AI systems: although large language models (LLMs) are adept at following fixed guidelines, their judgment capabilities degrade when policies are unstable or context-dependent . In practice, such shortcomings lead to inconsistent moderation: either erroneously restricting legitimate expression or allowing harmful content to remain online . This raises a critical question for evaluation: Does high performance on existing static benchmarks truly guarantee robust generalization of AI judgment to real-world scenarios involving co-occurring violations and dynamically changing rules?
翻译:在线内容审核对于维护健康的数字环境至关重要,依赖人工智能执行此任务的需求持续增长。考虑一个使用国家刻板印象侮辱政治人物的用户评论。此示例说明了现实场景中的两个关键挑战:(1) 共现违规,即单个帖子违反多项政策(如偏见与人身攻击);(2) 动态审核规则,即违规判定取决于平台特定且随情境演变的指导方针。共现危害与动态变化规则的交叉凸显了当前人工智能系统的核心局限:尽管大语言模型(LLMs)擅长遵循固定准则,但当政策不稳定或依赖情境时,其判断能力会显著下降。实践中,此类缺陷导致审核结果不一致:可能错误限制合法表达,也可能允许有害内容留存于网络。这引发了一个关键的评估问题:在现有静态基准上的高性能是否真正保证了人工智能判断在涉及共现违规与动态变化规则的真实场景中具备稳健的泛化能力?