The rapid proliferation of AI models has underscored the importance of thorough documentation, as it enables users to understand, trust, and effectively utilize these models in various applications. Although developers are encouraged to produce model cards, it's not clear how much information or what information these cards contain. In this study, we conduct a comprehensive analysis of 32,111 AI model documentations on Hugging Face, a leading platform for distributing and deploying AI models. Our investigation sheds light on the prevailing model card documentation practices. Most of the AI models with substantial downloads provide model cards, though the cards have uneven informativeness. We find that sections addressing environmental impact, limitations, and evaluation exhibit the lowest filled-out rates, while the training section is the most consistently filled-out. We analyze the content of each section to characterize practitioners' priorities. Interestingly, there are substantial discussions of data, sometimes with equal or even greater emphasis than the model itself. To evaluate the impact of model cards, we conducted an intervention study by adding detailed model cards to 42 popular models which had no or sparse model cards previously. We find that adding model cards is moderately correlated with an increase weekly download rates. Our study opens up a new perspective for analyzing community norms and practices for model documentation through large-scale data science and linguistics analysis.
翻译:人工智能模型的快速普及凸显了完备文档的重要性,因为文档使用户能够理解、信任并在各种应用中有效利用这些模型。尽管开发者被鼓励编写模型卡,但目前尚不清楚这些卡片包含多少信息或包含哪些信息。本研究对Hugging Face(一个领先的AI模型分发与部署平台)上的32,111份AI模型文档进行了全面分析。我们的调查揭示了当前模型卡文档实践的普遍特点:大多数下载量较大的AI模型都提供了模型卡,但这些卡片的信息量参差不齐。我们发现,环境影响、局限性和评估等部分的填写率最低,而训练部分则是填写最一致的部分。我们分析了每个部分的内容,以表征从业者的优先事项。有趣的是,数据相关的讨论占据了很大篇幅,有时甚至与模型本身讨论的篇幅相当或更多。为评估模型卡的影响,我们进行了一项干预研究,为42个之前未有或仅有少量模型卡的流行模型添加了详细的模型卡。研究发现,添加模型卡与周下载量的适度增长之间存在中等程度的相关性。本研究通过大规模数据科学和语言学分析,为分析模型文档的社区规范与实践提供了全新视角。