On the synchronization between Hugging Face pre-trained language models and their upstream GitHub repository

Pre-trained language models (PTLMs) have transformed natural language processing (NLP), enabling major advances in tasks such as text generation and translation. Similar to software package management, PTLMs are developed using code and environment scripts hosted in upstream repositories (e.g., GitHub), while families of trained model variants are distributed through downstream platforms such as Hugging Face (HF). Despite this similarity, coordinating development and release activities across these platforms remains challenging, leading to misaligned timelines, inconsistent versioning practices, and barriers to effective reuse. To examine how commit activities are coordinated between GitHub and HF, we conducted an in-depth mixed-method study of 325 PTLM families comprising 904 HF model variants. Our findings show that GitHub contributors primarily focus on model version specification, code quality improvements, performance optimization, and dependency management, whereas HF contributors mainly address model documentation, dataset handling, and inference setup. We further analyze synchronization across three dimensions -- lag, type, and intensity -- revealing eight distinct synchronization patterns. The dominance of partially synchronized patterns, such as Disperse and Sparse synchronization, highlights structural disconnects in cross-platform release practices. These disconnects often result in isolated or abandoned updates, increasing the risk of incomplete, outdated, or behaviorally inconsistent models being exposed to end users. Recognizing these synchronization patterns is essential for improving oversight and traceability in PTLM release workflows.

翻译：预训练语言模型（PTLMs）已彻底改变自然语言处理（NLP）领域，在文本生成与翻译等任务中实现了重大突破。与软件包管理类似，PTLM的开发依赖于上游仓库（如GitHub）托管的代码与环境脚本，而训练完成的模型变体系列则通过Hugging Face（HF）等下游平台进行分发。尽管存在这种相似性，跨平台协调开发与发布活动仍面临挑战，导致时间线错位、版本管理实践不一致以及有效复用的障碍。为探究GitHub与HF之间的提交活动如何协调，我们对包含904个HF模型变体的325个PTLM系列进行了深入的混合方法研究。研究发现：GitHub贡献者主要关注模型版本规范、代码质量改进、性能优化与依赖管理，而HF贡献者则侧重处理模型文档、数据集管理与推理配置。我们进一步从延迟、类型与强度三个维度分析同步性，揭示了八种不同的同步模式。部分同步模式（如分散式与稀疏式同步）的主导地位，凸显了跨平台发布实践中存在的结构性脱节。这些脱节常导致更新孤立或遭弃用，使得终端用户面临不完整、过时或行为不一致模型的风险增加。识别这些同步模式对于提升PTLM发布流程的监管与可追溯性至关重要。

相关内容

GitHub

关注 88

http://GitHub.com 使用 Git 作为版本控制系统（version control system）提供在线源码托管的服务，同时是个有社交功能的开发者社区。国外类似服务： http://Bitbucket.com
http://Gitlab.com
国内类似服务：
http://Coding.net

【牛津大学博士论文】构建具有一致性预测的可信语言模型

专知会员服务

17+阅读 · 2025年4月24日

【NTU博士论文】针对预训练语言模型的数据高效领域适应,150页pdf

专知会员服务

50+阅读 · 2023年5月24日