The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and humanities domains in both English and Hindi. Unlike synthetic benchmarks, IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability. The framework automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) prompting strategies and supports modular integration of new models and languages. Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings. First, CoT prompting consistently improves reasoning accuracy, with substantial gains across subjects and languages. Second, significant cross-model performance disparities persist, particularly in high-complexity examinations. Third, multilingual degradation remains a critical challenge, with marked accuracy drops in Hindi compared to English, especially under Zero-Shot conditions. These results highlight persistent gaps in bilingual reasoning and domain transfer. Overall, IndicEval provides a practice-oriented, extensible foundation for rigorous, equitable evaluation of LLMs in multilingual educational settings and offers actionable insights for improving reasoning robustness and language adaptability.
翻译:大型语言模型(LLMs)的快速发展亟需能够反映真实世界学术严谨性与多语言复杂性的评估框架。本文提出IndicEval——一个可扩展的基准测试平台,旨在利用UPSC、JEE及NEET等高风险考试中真实的STEM与人文学科试题(涵盖英语与印地语双语言)来评估LLM性能。与合成基准不同,IndicEval将评估植根于真实考试标准,从而能够切实衡量模型的推理能力、领域知识及双语适应性。该框架通过零样本、少样本及思维链(CoT)提示策略实现自动化评估,并支持新模型与新语言的模块化集成。在Gemini 2.0 Flash、GPT-4、Claude及LLaMA 3-70B上进行的实验揭示了三个主要发现:首先,CoT提示持续提升推理准确率,在各学科与语言中均取得显著增益;其次,模型间性能差异依然显著,尤其在高度复杂的考试中表现突出;第三,多语言性能衰减仍是关键挑战,印地语准确率较英语明显下降,在零样本条件下尤为严重。这些结果凸显了双语推理与领域迁移方面持续存在的差距。总体而言,IndicEval为多语言教育场景中LLM的严谨、公平评估提供了一个面向实践、可扩展的基础框架,并为提升推理鲁棒性与语言适应性提供了可操作的见解。