It is well-established that large, diverse datasets play a pivotal role in the performance of modern AI systems for text and image modalities. However, there are no datasets for tabular data of comparable size and diversity to those available for text and images. Thus we present "TabLib'', a compilation of 627 million tables totaling 69 TiB, along with 867B tokens of context. TabLib was extracted from numerous file formats, including CSV, HTML, SQLite, PDF, Excel, and others, sourced from GitHub and Common Crawl. The size and diversity of TabLib offer considerable promise in the table modality, reminiscent of the original promise of foundational datasets for text and images, such as The Pile and LAION.
翻译:摘要:众所周知,大规模多样化数据集在文本与图像模态的现代人工智能系统性能中发挥着关键作用。然而,当前尚不存在与文本和图像数据集同等规模和多样性的表格数据集。为此,我们提出"TabLib"——一个包含6.27亿张表格(总计69 TiB)及8670亿token上下文的综合数据集。该数据集通过解析GitHub和Common Crawl中的CSV、HTML、SQLite、PDF、Excel等多种文件格式提取而成。TabLib的庞大规模与多样性为表格模态领域带来了显著前景,令人联想到《The Pile》和《LAION》等文本与图像基础数据集最初展现的潜力。