DRAGON: Robust Classification for Very Large Collections of Software Repositories

The ability to automatically classify source code repositories with ''topics'' that reflect their content and purpose is very useful, especially when navigating or searching through large software collections. However, existing approaches often rely heavily on README files and other metadata, which are frequently missing, limiting their applicability in real-world large-scale settings. We present DRAGON, a repository classifier designed for very large and diverse software collections. It operates entirely on lightweight signals commonly stored in version control systems: file and directory names, and optionally the README when available. In repository classification at scale, DRAGON improves F1@5 from 54.8% to 60.8%, surpassing the state of the art. DRAGON remains effective even when README files are absent, with performance degrading by only 6% w.r.t. when they are present. This robustness makes it practical for real-world settings where documentation is sparse or inconsistent. Furthermore, many of the remaining classification errors are near misses, where predicted labels are semantically close to the correct topics. This property increases the practical value of the predictions in real-world software collections, where suggesting a few related topics can still guide search and discovery. As a byproduct of developing DRAGON, we also release the largest open dataset to date for repository classification, consisting of 825 thousand repositories with associated ground-truth topics, sourced from the Software Heritage archive, providing a foundation for future large-scale and language-agnostic research on software repository understanding.

翻译：自动为源代码仓库标注反映其内容与用途的"主题"标签的能力极具实用价值，尤其在浏览或检索大型软件集合时。然而，现有方法通常严重依赖README文件等元数据，而这些资料往往缺失，限制了其在实际大规模场景中的应用。本文提出DRAGON——一种专为超大规模异构软件集合设计的仓库分类器。该分类器完全基于版本控制系统中普遍存储的轻量级信号运行：文件与目录名称，以及可选的README文件（若存在）。在大规模仓库分类任务中，DRAGON将F1@5指标从54.8%提升至60.8%，超越了现有最优方法。即使README文件缺失，DRAGON仍保持有效性能，其表现仅比存在README时下降6%。这种鲁棒性使其适用于文档稀缺或不规范的实际场景。此外，多数剩余分类误差属于近似错误——预测标签在语义层面接近正确主题。该特性提升了预测结果在实际软件集合中的实用价值，因为推荐若干相关主题仍能有效引导搜索与发现。作为开发DRAGON的副产品，我们同期发布了迄今最大的开源仓库分类数据集，包含源自Software Heritage档案的82.5万个带有真实主题标签的仓库，为未来大规模、语言无关的软件仓库理解研究奠定了基础。