Hate moderation is often evaluated as classification on clean English inputs, but deployed systems must route content to actions such as ALLOW, FLAG, or REVIEW. We study how this workflow changes under code-mixed inputs using a paired evaluation setting where the same underlying content is expressed as clean English and Tamil-English code-mix. Under thresholds tuned on clean English development data, code-mixed inputs produce substantial action instability, with a paired clean- to-code-mix decision flip rate of 0.265. The main workflow effects are increased review burden and increased false-flagging of non-hateful content: review rate rises from 0.138 to 0.297 and non-hate false-flag rate rises from 0.069 to 0.104. Tamil-only inputs show stronger degradation overall, suggesting a broader language-coverage limitation rather than the same code-mixed instability pattern. A simple disagreement-based deferral rule reduces automatic errors on stressed inputs, but only by increasing review load. These results show that workflow-level evaluation reveals moderation failures that standard classification summaries can miss.
翻译:仇恨内容审核通常被评估为针对干净英文输入的分类任务,但部署系统必须将内容路由至允许、标记或审核等操作中。我们采用配对评估设置研究代码混合输入下工作流的变化情况——其中相同底层内容分别以干净英文和泰米尔语-英文代码混合形式呈现。在基于干净英文开发数据调优的阈值下,代码混合输入产生了显著的决策不稳定性,配对干净输入到代码混合输入的决策反转率为0.265。主要工作流效应体现为审核负担增加与非仇恨内容的误标记率上升:审核率从0.138升至0.297,非仇恨误标记率从0.069升至0.104。仅在泰米尔语输入的条件下,整体性能退化更为严重,这暗示了更广泛的语言覆盖局限性,而非相同的代码混合不稳定性模式。一个基于分歧的简单回退规则可减少压力输入下的自动错误,但代价是增加审核负载。这些结果表明,工作流级别评估能揭示标准分类摘要可能遗漏的审核失效现象。