OpenHuEval: Evaluating Large Language Model on Hungarian Specifics

Haote Yang,Xingjian Wei,Jiang Wu,Noémi Ligeti-Nagy,Jiaxing Sun,Yinfan Wang,Zijian Győző Yang,Junyuan Gao,Jingchao Wang,Bowen Jiang,Shasha Wang,Nanjun Yu,Zihao Zhang,Shixin Hong,Hongwei Liu,Wei Li,Songyang Zhang,Dahua Lin,Lijun Wu,Gábor Prószéky,Conghui He

We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .

翻译：我们推出了OpenHuEval，这是首个专注于匈牙利语及其特性的LLM基准测试。OpenHuEval基于从多源收集的大量匈牙利语特定材料构建而成。在构建过程中，我们融入了最新的LLM评估设计原则，例如使用来自互联网的真实用户查询、强调评估LLM的生成能力，以及采用LLM-as-judge方法来提升评估的多维度和准确性。最终，OpenHuEval涵盖八个匈牙利语特定维度，包含五项任务和3953个问题。因此，OpenHuEval为LLM在匈牙利语及其特性背景下的表现提供了全面、深入且科学准确的评估。我们对当前主流LLM进行了评估，包括传统LLM和近期开发的大型推理模型。结果表明，针对匈牙利语及其特性进行专门评估和模型优化具有显著必要性。我们还建立了基于OpenHuEval分析LRM思维过程的框架，以匈牙利语为代表案例，揭示了这些模型在非英语语言中的内在模式和机制。我们将在https://github.com/opendatalab/OpenHuEval 发布OpenHuEval。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日