Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.
翻译:大型语言模型(LLMs)正日益被用作自动化标注工具以扩展数据集创建规模,但其作为无偏标注者的可靠性——特别是在低资源和身份敏感场景下——仍缺乏深入理解。本研究针对孟加拉语仇恨言论这一任务(该任务即使人类标注者间也难以达成一致,且标注者偏见可能引发严重的下游后果),系统考察了LLMs作为零样本标注者的行为。我们采用统一评估框架对17个LLMs进行了系统性基准测试。分析揭示了标注者偏见及模型判断的显著不稳定性。令人惊讶的是,模型规模的增大并不能保证标注质量的提升——更小规模但任务适配性更强的模型往往表现出比大型模型更稳定的标注行为。这些结果凸显了当前LLMs在低资源语言敏感标注任务中的重要局限性,并强调在部署前需要进行审慎评估。