Pre-trained language models (PTLMs) hosted on platforms such as Hugging Face form complex lineage structures similar to software dependency graphs. However, unlike traditional software ecosystems, PTLM repositories often lack reliable provenance due to missing metadata, such as licenses, reuse methods, pipeline tags, model types, and training libraries. To address this gap, we introduce Semantic Fingerprinting (SemFin), a lightweight approach that combines Hugging Face (HF) configuration files with model repository tags to automatically impute missing model metadata fields and reconstruct model lineage chains. We evaluate SemFin on a large-scale dataset of 317,133 PTLMs. Our results show that configuration files typically encode the technical requirements necessary to instantiate and reuse models, enabling them to serve as a structural blueprint for model reuse, particularly for transformer-based architectures. By combining these configuration files with model repository tags, SemFin significantly outperforms the existing propagation-based imputation approaches, improving prediction accuracy by up to 31.4% and 26.6% compared to Graph Avg and Hub Avg baselines. Importantly, SemFin also imputes metadata for 16.6% of isolated models where propagation-based methods fail. Applying SemFin to impute missing reuse-method and license metadata for 167,089 unlabeled models reveals that traceable reuse method chains expand by 75.9% and license lineage chains by 53.6%, uncovering 86 previously invisible reuse method patterns, while the proportion of incompatible license patterns only increases from 34.8% to 36.8%. These findings demonstrate how automatically derived structural signals can support the automated construction of AI Bills of Materials (AIBOMs), helping transform metadata from an error-prone manual declaration into information inferred directly from model artifacts.
翻译:暂无翻译