With the rapid growth of the Natural Language Processing (NLP) field, a vast variety of Large Language Models (LLMs) continue to emerge for diverse NLP tasks. As an increasing number of papers are presented, researchers and developers face the challenge of information overload. Thus, it is particularly important to develop a system that can automatically extract and organise key information about LLMs from academic papers (\textbf{LLM model card}). This work is to develop such a pioneer system by using Named Entity Recognition (\textbf{NER}) and Relation Extraction (\textbf{RE}) methods that automatically extract key information about large language models from the papers, helping researchers to efficiently access information about LLMs. These features include model \textit{licence}, model \textit{name}, and model \textit{application}. With these features, we can form a model card for each paper. \textbf{Data-contribution} wise, 106 academic papers were processed by defining three dictionaries - LLMs name, licence, and application. 11,051 sentences were extracted through dictionary lookup, and the dataset was constructed through manual review of the final selection of 129 sentences that have a link between the name and the licence, and 106 sentences that have a link between the model name and the application.
翻译:随着自然语言处理(NLP)领域的快速发展,针对各类NLP任务的大型语言模型(LLMs)层出不穷。随着相关论文数量的急剧增加,研究人员和开发者面临着信息过载的挑战。因此,开发一个能够从学术论文中自动提取并组织LLMs关键信息的系统(**LLM模型卡**)显得尤为重要。本研究旨在通过命名实体识别(**NER**)和关系抽取(**RE**)方法,构建一个能够从论文中自动提取大型语言模型关键信息的先驱系统,以帮助研究者高效获取LLMs相关信息。这些特征包括模型**许可证**、模型**名称**以及模型**应用**。基于这些特征,我们可以为每篇论文生成一张模型卡。在**数据贡献**方面,我们通过定义三个词典(LLMs名称、许可证、应用)处理了106篇学术论文。通过词典匹配提取了11,051个句子,并经过人工审阅最终构建了数据集,其中包含129个在模型名称与许可证之间存在关联的句子,以及106个在模型名称与应用之间存在关联的句子。