Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting

Pre-trained language models (PTLMs) hosted on platforms such as Hugging Face form complex lineage structures similar to software dependency graphs. However, unlike traditional software ecosystems, PTLM repositories often lack reliable provenance due to missing metadata, such as licenses, reuse methods, pipeline tags, model types, and training libraries. To address this gap, we introduce Semantic Fingerprinting (SemFin), a lightweight approach that combines Hugging Face (HF) configuration files with model repository tags to automatically impute missing model metadata fields and reconstruct model lineage chains. We evaluate SemFin on a large-scale dataset of 317,133 PTLMs. Our results show that configuration files typically encode the technical requirements necessary to instantiate and reuse models, enabling them to serve as a structural blueprint for model reuse, particularly for transformer-based architectures. By combining these configuration files with model repository tags, SemFin significantly outperforms the existing propagation-based imputation approaches, improving prediction accuracy by up to 31.4% and 26.6% compared to Graph Avg and Hub Avg baselines. Importantly, SemFin also imputes metadata for 16.6% of isolated models where propagation-based methods fail. Applying SemFin to impute missing reuse-method and license metadata for 167,089 unlabeled models reveals that traceable reuse method chains expand by 75.9% and license lineage chains by 53.6%, uncovering 86 previously invisible reuse method patterns, while the proportion of incompatible license patterns only increases from 34.8% to 36.8%. These findings demonstrate how automatically derived structural signals can support the automated construction of AI Bills of Materials (AIBOMs), helping transform metadata from an error-prone manual declaration into information inferred directly from model artifacts.

翻译：暂无翻译

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NTU博士论文】针对预训练语言模型的数据高效领域适应,150页pdf

专知会员服务

50+阅读 · 2023年5月24日

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

专知会员服务

26+阅读 · 2022年3月15日