The evaluation of large language models is an essential task in the field of language understanding and generation. As language models continue to advance, the need for effective benchmarks to assess their performance has become imperative. In the context of Traditional Chinese, there is a scarcity of comprehensive and diverse benchmarks to evaluate the capabilities of language models, despite the existence of certain benchmarks such as DRCD, TTQA, CMDQA, and FGC dataset. To address this gap, we propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese. These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding. The proposed benchmarks offer a comprehensive evaluation framework, enabling the assessment of language models' capabilities across different tasks. In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks. The evaluation results highlight that our model, Model 7-C, achieves performance comparable to GPT-3.5 with respect to a part of the evaluated capabilities. In an effort to advance the evaluation of language models in Traditional Chinese and stimulate further research in this field, we have open-sourced our benchmark and opened the model for trial.
翻译:大型語言模型的評估是語言理解與生成領域中的重要任務。隨著語言模型持續進步,建立有效的基準測試以評估其效能已成為必要之舉。在繁體中文的脈絡下,儘管已存在DRCD、TTQA、CMDQA及FGC資料集等若干基準測試,但全面且多元的評估基準仍顯匱乏。為填補此缺口,我們提出一套新穎的基準測試,其利用現有英文資料集,並針對繁體中文語言模型評估進行調適。這些基準測試涵蓋廣泛任務,包括情境式問答、摘要、分類及表格理解。所提出的基準測試提供全面評估框架,能跨任務評估語言模型能力。本文中,我們在這些基準測試上評估GPT-3.5、Taiwan-LLaMa-v1.0及我們的專屬模型Model 7-C之效能。評估結果顯示,我們的Model 7-C在部分評估能力上可達到與GPT-3.5相當的表現。為促進繁體中文語言模型評估之進展,並激勵此領域的進一步研究,我們已開源基準測試,並開放模型供試用。