GitHub is the world's largest platform for collaborative software development, with over 100 million users. GitHub is also used extensively for open data collaboration, hosting more than 800 million open data files, totaling 142 terabytes of data. This study highlights the potential of open data on GitHub and demonstrates how it can accelerate AI research. We analyze the existing landscape of open data on GitHub and the patterns of how users share datasets. Our findings show that GitHub is one of the largest hosts of open data in the world and has experienced an accelerated growth of open data assets over the past four years. By examining the open data landscape on GitHub, we aim to empower users and organizations to leverage existing open datasets and improve their discoverability -- ultimately contributing to the ongoing AI revolution to help address complex societal issues. We release the three datasets that we have collected to support this analysis as open datasets at https://github.com/github/open-data-on-github.
翻译:GitHub是全球最大的协作软件开发平台,拥有超过1亿用户。GitHub也被广泛用于开放数据协作,托管着超过8亿个开放数据文件,总计142TB的数据。本研究强调了GitHub上开放数据的潜力,并展示了其如何加速人工智能研究。我们分析了GitHub上开放数据的现有格局以及用户共享数据集的方式。我们的研究发现表明,GitHub是全球最大的开放数据托管平台之一,并且在过去四年中,其开放数据资产经历了加速增长。通过审视GitHub上的开放数据格局,我们旨在赋能用户和组织利用现有的开放数据集并提升其可发现性——最终为正在进行的AI革命作出贡献,以帮助应对复杂的社会问题。我们将为支持此项分析而收集的三个数据集作为开放数据集发布于https://github.com/github/open-data-on-github。