Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

from arxiv, 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL) Thirteenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) 2026

Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.

翻译：多语言大语言模型（LLMs）正日益部署在印度等多语言地区，然而大多数可解释性工具仍主要面向英语设计。先前研究表明，LLMs通常在以英语为中心的表示空间中运作，这使得跨语言可解释性成为亟待解决的问题。本文提出Indic-TunedLens——一个专门针对印度语言的新型可解释性框架，该框架通过学习共享仿射变换实现跨语言解释。与直接解码中间激活值的标准Logit Lens不同，Indic-TunedLens针对每种目标语言调整隐藏状态，使其与目标输出分布对齐，从而实现对模型表征更忠实的解码。我们在MMLU基准测试中使用10种印度语言评估本框架，发现其性能显著优于当前最先进的可解释性方法，尤其在形态丰富、资源匮乏的语言中表现突出。我们的研究结果为理解多语言Transformer的层级语义编码机制提供了重要洞见。模型已发布于https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens，代码开源地址为https://github.com/MihirRajeshPanchal/IndicTunedLens。

相关内容

可解释性

关注 81

广义上的可解释性指在我们需要了解或解决一件事情的时候，我们可以获得我们所需要的足够的可以理解的信息，也就是说一个人能够持续预测模型结果的程度。按照可解释性方法进行的过程进行划分的话，大概可以划分为三个大类：在建模之前的可解释性方法，建立本身具备可解释性的模型，在建模之后使用可解释性方法对模型作出解释。

面向大语言模型对齐的机械解释性：进展、挑战与未来方向

专知会员服务

14+阅读 · 2月14日

稀疏自编码器综述：解释大语言模型的内部机制

专知会员服务

17+阅读 · 2025年12月27日

大型语言模型遇上文本属性图：一种融合框架与应用的综述

专知会员服务

10+阅读 · 2025年10月27日

可解释人工智能中的大语言模型：全面综述

专知会员服务

54+阅读 · 2025年4月2日