We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .
翻译:我们推出了OpenHuEval,这是首个专注于匈牙利语及其特性的LLM基准测试。OpenHuEval基于从多源收集的大量匈牙利语特定材料构建而成。在构建过程中,我们融入了最新的LLM评估设计原则,例如使用来自互联网的真实用户查询、强调评估LLM的生成能力,以及采用LLM-as-judge方法来提升评估的多维度和准确性。最终,OpenHuEval涵盖八个匈牙利语特定维度,包含五项任务和3953个问题。因此,OpenHuEval为LLM在匈牙利语及其特性背景下的表现提供了全面、深入且科学准确的评估。我们对当前主流LLM进行了评估,包括传统LLM和近期开发的大型推理模型。结果表明,针对匈牙利语及其特性进行专门评估和模型优化具有显著必要性。我们还建立了基于OpenHuEval分析LRM思维过程的框架,以匈牙利语为代表案例,揭示了这些模型在非英语语言中的内在模式和机制。我们将在https://github.com/opendatalab/OpenHuEval 发布OpenHuEval。