Large language models (LLMs) have demonstrated impressive performance in various natural language processing (NLP) tasks. However, there is limited understanding of how well LLMs perform in specific domains (e.g, the intellectual property (IP) domain). In this paper, we contribute a new benchmark, the first Multilingual-oriented quiZ on Intellectual Property (MoZIP), for the evaluation of LLMs in the IP domain. The MoZIP benchmark includes three challenging tasks: IP multiple-choice quiz (IPQuiz), IP question answering (IPQA), and patent matching (PatentMatch). In addition, we also develop a new IP-oriented multilingual large language model (called MoZi), which is a BLOOMZ-based model that has been supervised fine-tuned with multilingual IP-related text data. We evaluate our proposed MoZi model and four well-known LLMs (i.e., BLOOMZ, BELLE, ChatGLM and ChatGPT) on the MoZIP benchmark. Experimental results demonstrate that MoZi outperforms BLOOMZ, BELLE and ChatGLM by a noticeable margin, while it had lower scores compared with ChatGPT. Notably, the performance of current LLMs on the MoZIP benchmark has much room for improvement, and even the most powerful ChatGPT does not reach the passing level. Our source code, data, and models are available at \url{https://github.com/AI-for-Science/MoZi}.
翻译:大语言模型(LLMs)已在多种自然语言处理(NLP)任务中展现出卓越性能。然而,目前对LLMs在特定领域(如知识产权领域)表现的理解仍存在局限。本文提出首个面向知识产权的多语言评测基准MoZIP(Multilingual-oriented quiZ on Intellectual Property),用于评估LLMs在知识产权领域的能力。该基准包含三项挑战性任务:知识产权多选题(IPQuiz)、知识产权问答(IPQA)与专利匹配(PatentMatch)。此外,我们开发了面向知识产权的新型多语言大语言模型MoZi——基于BLOOMZ架构,通过多语言知识产权文本数据进行有监督微调。我们在MoZIP基准上评估了所提出的MoZi模型及四个知名LLMs(BLOOMZ、BELLE、ChatGLM与ChatGPT)。实验结果表明,MoZi显著优于BLOOMZ、BELLE和ChatGLM,但得分低于ChatGPT。值得注意的是,当前LLMs在MoZIP基准上的性能仍有较大提升空间,即便是性能最强的ChatGPT也未达到及格水平。我们的源代码、数据与模型已开源发布于\url{https://github.com/AI-for-Science/MoZi}。