With the massive surge in ML models on platforms like Hugging Face, users often lose track and struggle to choose the best model for their downstream tasks, frequently relying on model popularity indicated by download counts, likes, or recency. We investigate whether this popularity aligns with actual model performance and how the comprehensiveness of model documentation correlates with both popularity and performance. In our study, we evaluated a comprehensive set of 500 Sentiment Analysis models on Hugging Face. This evaluation involved massive annotation efforts, with human annotators completing nearly 80,000 annotations, alongside extensive model training and evaluation. Our findings reveal that model popularity does not necessarily correlate with performance. Additionally, we identify critical inconsistencies in model card reporting: approximately 80\% of the models analyzed lack detailed information about the model, training, and evaluation processes. Furthermore, about 88\% of model authors overstate their models' performance in the model cards. Based on our findings, we provide a checklist of guidelines for users to choose good models for downstream tasks.
翻译:随着Hugging Face等平台上机器学习模型数量的激增,用户常常难以追踪并难以为其下游任务选择最佳模型,往往依赖于下载量、点赞数或时效性所指示的模型流行度。本研究探讨了这种流行度是否与实际模型性能相符,以及模型文档的完整性与流行度和性能之间的关联。在我们的研究中,我们对Hugging Face上500个情感分析模型进行了全面评估。该评估涉及大规模标注工作,人工标注者完成了近80,000条标注,同时进行了广泛的模型训练与评估。我们的研究结果表明,模型流行度并不必然与性能相关。此外,我们发现了模型卡片报告中的关键不一致性:约80%的被分析模型缺乏关于模型、训练和评估过程的详细信息。更有甚者,约88%的模型作者在模型卡片中夸大了其模型的性能。基于这些发现,我们为用户提供了一份指南清单,以帮助其在下游任务中选择优质模型。