As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.
翻译:随着大语言模型在多语言场景中的部署,其在文化多样化且低资源语言环境中的安全性表现仍鲜为人知。我们首次系统评估了12种印度语言中大语言模型的安全性——这些语言拥有超过12亿使用者,但在模型训练数据中代表性不足。通过使用涵盖种姓、宗教、性别、健康与政治等领域的6000个文化相关性提示数据集,我们评估了10个主流大语言模型在提示词翻译变体上的表现。分析揭示了显著的安全偏移现象:跨语言一致性仅为12.8%,且不同语言间的\texttt{SAFE}比率方差超过17%。部分模型在低资源文字场景中对良性提示过度拒绝,对政治敏感话题过度标记,而另一些模型则未能标记不安全生成内容。我们通过提示级熵值、类别偏差分数及多语言一致性指数量化了这些缺陷。研究结果揭示了多语言大语言模型的关键安全泛化缺口,表明安全对齐无法在语言间均匀迁移。我们发布了首个支持印度语言部署场景下文化感知安全评估的基准\textsc{IndicSafe},并倡导基于区域性危害的语言感知对齐策略。