There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators ("LLM-as-a-judge") is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a comprehensive toxicity evaluation of LLMs across diverse dialects. We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties. We then evaluated three LLMs on their ability to assess toxicity across multilingual, dialectal, and LLM-human consistency. Our findings show that LLMs are sensitive in handling both multilingual and dialectal variations. However, if we have to rank the consistency, the weakest area is LLM-human agreement, followed by dialectal consistency. Code repository: \url{https://github.com/ffaisal93/dialect_toxicity_llm_judge}
翻译:目前关于方言差异如何影响现代大语言模型毒性检测的系统性研究尚少。此外,尽管使用大语言模型作为评估器("LLM-as-a-judge")是一个日益增长的研究领域,但其对方言细微差别的敏感性仍未得到充分探索,需要更多关注。本文通过跨多种方言对大语言模型进行全面毒性评估来填补这些空白。我们通过合成转换和人工辅助翻译创建了一个多方言数据集,涵盖10个语言簇和60种变体。随后,我们评估了三种大语言模型在跨多语言、跨方言以及LLM-人类一致性方面的毒性评估能力。我们的研究结果表明,大语言模型在处理多语言和方言变异时均表现出敏感性。然而,若必须对一致性进行排序,最薄弱的环节是LLM与人类评估者之间的一致性,其次是跨方言一致性。代码仓库:\url{https://github.com/ffaisal93/dialect_toxicity_llm_judge}