Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.
翻译:多语言大语言模型(LLMs)正日益部署在印度等多语言地区,然而大多数可解释性工具仍主要面向英语设计。先前研究表明,LLMs通常在以英语为中心的表示空间中运作,这使得跨语言可解释性成为亟待解决的问题。本文提出Indic-TunedLens——一个专门针对印度语言的新型可解释性框架,该框架通过学习共享仿射变换实现跨语言解释。与直接解码中间激活值的标准Logit Lens不同,Indic-TunedLens针对每种目标语言调整隐藏状态,使其与目标输出分布对齐,从而实现对模型表征更忠实的解码。我们在MMLU基准测试中使用10种印度语言评估本框架,发现其性能显著优于当前最先进的可解释性方法,尤其在形态丰富、资源匮乏的语言中表现突出。我们的研究结果为理解多语言Transformer的层级语义编码机制提供了重要洞见。模型已发布于https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens,代码开源地址为https://github.com/MihirRajeshPanchal/IndicTunedLens。