Large language models (LLMs) have shown remarkable capabilities in generating high-quality text and making predictions based on large amounts of data, including the media domain. However, in practical applications, the differences between the media's use cases and the general-purpose applications of LLMs have become increasingly apparent, especially Chinese. This paper examines the unique characteristics of media-domain-specific LLMs compared to general LLMs, designed a diverse set of task instruction types to cater the specific requirements of the domain and constructed unique datasets that are tailored to the media domain. Based on these, we proposed MediaGPT, a domain-specific LLM for the Chinese media domain, training by domain-specific data and experts SFT data. By performing human experts evaluation and strong model evaluation on a validation set, this paper demonstrated that MediaGPT outperforms mainstream models on various Chinese media domain tasks and verifies the importance of domain data and domain-defined prompt types for building an effective domain-specific LLM.
翻译:大语言模型在基于海量数据生成高质量文本和进行预测方面展现出显著能力,涵盖媒体领域。然而在实际应用中,媒体使用场景与大语言模型通用型应用之间的差异日益凸显,尤其以中文媒体领域为甚。本文系统分析了媒体领域专用大语言模型与通用大语言模型的差异化特征,设计了一套多样化的任务指令类型以满足该领域的特定需求,并构建了专属于媒体领域的独特数据集。基于上述工作,我们提出了MediaGPT——面向中文媒体领域的专用大语言模型,该模型通过领域数据和专家监督微调数据进行训练。通过在验证集上进行人工专家评估与强模型评估,本文证明MediaGPT在多种中文媒体领域任务上优于主流模型,并验证了领域数据与领域定义的提示类型对构建高效领域专用大语言模型的关键作用。