Data users need relevant context and research expertise to effectively search for and identify relevant datasets. Leading data providers, such as the Inter-university Consortium for Political and Social Research (ICPSR), offer standardized metadata and search tools to support data search. Metadata standards emphasize the machine-readability of data and its documentation. There are opportunities to enhance dataset search by improving users' ability to learn about, and make sense of, information about data. Prior research has shown that context and expertise are two main barriers users face in effectively searching for, evaluating, and deciding whether to reuse data. In this paper, we propose a novel chatbot-based search system, DataChat, that leverages a graph database and a large language model to provide novel ways for users to interact with and search for research data. DataChat complements data archives' and institutional repositories' ongoing efforts to curate, preserve, and share research data for reuse by making it easier for users to explore and learn about available research data.
翻译:数据用户需要相关背景知识和研究经验才能有效搜索并识别相关数据集。领先的数据提供机构(如跨校政治与社会研究联盟ICPSR)提供标准化元数据和搜索工具以支持数据检索。元数据标准强调数据及其文档的机器可读性。通过提升用户理解与解析数据信息的能力,可为增强数据集搜索创造机会。既有研究表明,背景知识与研究经验是用户有效搜索、评估并决定是否重用数据时面临的两大主要障碍。本文提出一种新型的基于聊天机器人的搜索系统DataChat,该工具利用图数据库和大语言模型,为用户提供与研究数据交互和检索的创新方式。DataChat通过降低用户探索和学习现有研究数据的门槛,补充了数据档案馆与机构知识库在数据策管、保存及共享重用方面的持续努力。