With the recent development of large language models (LLMs), models that focus on certain domains and languages have been discussed for their necessity. There is also a growing need for benchmarks to evaluate the performance of current LLMs in each domain. Therefore, in this study, we constructed a benchmark comprising multiple tasks specific to the Japanese and financial domains and performed benchmark measurements on some models. Consequently, we confirmed that GPT-4 is currently outstanding, and that the constructed benchmarks function effectively. According to our analysis, our benchmark can differentiate benchmark scores among models in all performance ranges by combining tasks with different difficulties.
翻译:随着大语言模型(LLMs)的近期发展,针对特定领域和语言的模型需求已受到探讨。与此同时,评估当前LLMs在各领域性能的基准测试需求也日益增长。为此,本研究构建了一个包含日语及金融领域特定多任务的基准测试集,并对若干模型进行了基准测试。结果表明,GPT-4目前性能卓越,且所构建的基准测试集功能有效。据分析,通过组合不同难度的任务,本基准测试能够在所有性能范围内区分各模型的评分差异。