Data sharing and ontology use among agricultural genetics, genomics, and breeding databases and resources of the AgBioData Consortium

Jennifer L. Clarke,Laurel D. Cooper,Monica F. Poelchau,Tanya Z. Berardini,Justin Elser,Andrew D. Farmer,Stephen Ficklin,Sunita Kumari,Marie-Angélique Laporte,Rex T. Nelson,Rie Sadohara,Peter Selby,Anne E. Thessen,Brandon Whitehead,Taner Z. Sen

from arxiv, 17 pages, 8 figures

Over the last several decades, there has been rapid growth in the number and scope of agricultural genetics, genomics and breeding (GGB) databases and resources. The AgBioData Consortium (https://www.agbiodata.org/) currently represents 44 databases and resources covering model or crop plant and animal GGB data, ontologies, pathways, genetic variation and breeding platforms (referred to as 'databases' throughout). One of the goals of the Consortium is to facilitate FAIR (Findable, Accessible, Interoperable, and Reusable) data management and the integration of datasets which requires data sharing, along with structured vocabularies and/or ontologies. Two AgBioData working groups, focused on Data Sharing and Ontologies, conducted a survey to assess the status and future needs of the members in those areas. A total of 33 researchers responded to the survey, representing 37 databases. Results suggest that data sharing practices by AgBioData databases are in a healthy state, but it is not clear whether this is true for all metadata and data types across all databases; and that ontology use has not substantially changed since a similar survey was conducted in 2017. We recommend 1) providing training for database personnel in specific data sharing techniques, as well as in ontology use; 2) further study on what metadata is shared, and how well it is shared among databases; 3) promoting an understanding of data sharing and ontologies in the stakeholder community; 4) improving data sharing and ontologies for specific phenotypic data types and formats; and 5) lowering specific barriers to data sharing and ontology use, by identifying sustainability solutions, and the identification, promotion, or development of data standards. Combined, these improvements are likely to help AgBioData databases increase development efforts towards improved ontology use, and data sharing via programmatic means.

翻译：过去几十年间，农业遗传学、基因组学和育种（GGB）数据库及资源的数量与范围迅速增长。AgBioData联盟（https://www.agbiodata.org/）目前涵盖44个数据库及资源，涉及模式生物或作物及动物的GGB数据、本体、通路、遗传变异与育种平台（以下统称“数据库”）。联盟的目标之一是推动FAIR（可发现、可访问、可互操作、可复用）数据管理与数据集整合，这需要通过结构化词汇表和/或本体实现数据共享。联盟下属的数据共享与本体两个工作组通过问卷调查评估了成员在上述领域的现状与未来需求。共有33名研究人员（代表37个数据库）参与调查。结果表明，AgBioData数据库的数据共享实践总体良好，但尚不明确是否所有数据库的所有元数据与数据类型均达到同等水平；此外，与2017年类似调查相比，本体应用未见显著变化。我们建议：1）为数据库人员提供数据共享特定技术与本体应用的培训；2）进一步研究数据库间共享的元数据内容及其共享质量；3）促进利益相关方群体对数据共享与本体的理解；4）针对特定表型数据类型与格式改进数据共享与本体体系；5）通过识别可持续性解决方案、明确/推广/制定数据标准，降低数据共享与本体应用的具体障碍。综合这些改进措施，有望推动AgBioData数据库加大开发力度，通过编程化手段提升本体应用水平与数据共享能力。