ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces. We describe our implementation and the possible use cases of our tool.
翻译:ROOTS是一个为训练BLOOM而开发的1.6TB多语言文本语料库,是目前规模最大且明确伴随相应数据治理举措的语言模型。延续这些工作,我们推出了ROOTS搜索工具:一个覆盖整个ROOTS语料库的搜索引擎,同时提供模糊搜索和精确搜索功能。ROOTS是迄今可基于该方式进行研究的最大的语料库。ROOTS搜索工具已开源,并可在Hugging Face Spaces上获取。我们描述了该工具的实现方法及其可能的应用场景。