As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.
翻译:随着大语言模型在多语言场景中的部署,其在文化多样、低资源语言中的安全行为仍然鲜有研究。我们首次系统评估了12种印度语言中LLM的安全表现,这些语言虽有超过12亿使用者,但在LLM训练数据中代表性不足。利用涵盖种姓、宗教、性别、健康与政治等维度的6000条文化特定提示语数据集,我们评估了10个主流LLM在提示语翻译变体上的表现。分析揭示显著的安全偏移:跨语言一致性仅为12.8%,且不同语言间\texttt{SAFE}率方差超过17%。部分模型对低资源文字的良性提示过度拒答,过度标记政治敏感话题,而另一些模型则未能标记不安全生成内容。我们通过提示级熵值、类别偏倚分数和多语言一致性指标量化了这些失败。研究结果凸显了多语言LLM在安全泛化方面的关键缺口,表明安全对齐在不同语言间并不均匀迁移。我们发布首个支持印度语部署场景下文化知情安全评估的基准\textsc{IndicSafe},并倡导基于区域性危害的语言感知对齐策略。