Noticing the urgent need to provide tools for fast and user-friendly qualitative analysis of large-scale textual corpora of the modern NLP, we propose to turn to the mature and well-tested methods from the domain of Information Retrieval (IR) - a research field with a long history of tackling TB-scale document collections. We discuss how Pyserini - a widely used toolkit for reproducible IR research can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts. We leverage the existing functionalities of both platforms while proposing novel features further facilitating their integration. Our goal is to give NLP researchers tools that will allow them to develop retrieval-based instrumentation for their data analytics needs with ease and agility. We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub at https://github.com/huggingface/gaia. We then demonstrate how the ideas we present can be operationalized to create a powerful tool for qualitative data analysis in NLP. We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections. GAIA serves a dual purpose of illustrating the potential of methodologies we discuss but also as a standalone qualitative analysis tool that can be leveraged by NLP researchers aiming to understand datasets prior to using them in training. GAIA is hosted live on Hugging Face Spaces - https://huggingface.co/spaces/spacerini/gaia.
翻译:鉴于现代自然语言处理(NLP)领域亟需对大规模文本语料库进行快速且用户友好的定性分析工具,我们提出借鉴信息检索(IR)领域的成熟验证方法——该研究领域在处理TB级文档集合方面拥有悠久历史。我们探讨了如何将广泛用于可复现IR研究的Pyserini工具包与Hugging Face开源AI库及制品生态系统进行集成。在利用两个平台既有功能的基础上,我们提出了进一步促进其集成的新特性。我们的目标是为NLP研究者提供工具,使其能够灵活、敏捷地构建基于检索的数据分析仪器。我们提供了基于Jupyter Notebook的核心互操作特性实践指南(GitHub仓库:https://github.com/huggingface/gaia),并演示了如何将所提出的理念落地为NLP定性数据分析的强大工具。我们展示了GAIA Search——一个遵循前述原则构建的搜索引擎,可访问四个流行的大规模文本集合。GAIA兼具双重功能:既展示了我们所讨论方法的潜力,又可作为独立的定性分析工具,供NLP研究者在训练前理解数据集特性时使用。GAIA已在Hugging Face Spaces上部署运行:https://huggingface.co/spaces/spacerini/gaia。