ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction

The use of large language models for web information extraction is becoming increasingly fundamental to modern web information retrieval pipelines. However, existing datasets tend to be small, synthetic or text-only, failing to capture the structural context of the web. We introduce ScrapeGraphAI-100k, a large-scale dataset comprising real-world LLM extraction events, collected via opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains and languages. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. We characterize the datasets structural diversity and its failure modes as schema complexity increases. We also provide a fine-tuning experiment showing that a small language model (1.7B) trained on a subset narrows the gap to larger baselines (30B), underscoring the datasets utility for efficient extraction. ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.

翻译：利用大语言模型进行网络信息抽取正日益成为现代网络信息检索流程的基础。然而，现有数据集往往规模较小、合成生成或仅为纯文本，未能捕捉网络的结构化上下文。我们介绍了ScrapeGraphAI-100k，这是一个大规模数据集，包含真实世界的大语言模型抽取事件，通过ScrapeGraphAI在2025年第二和第三季度自愿加入的遥测数据收集而来。我们从900万个事件出发，通过模式去重和平衡，最终得到93,695个示例，涵盖多样化的领域和语言。每个实例均包含Markdown内容、提示词、JSON模式、大语言模型响应以及复杂度与验证元数据。我们描述了数据集的结构多样性及其在模式复杂度增加时的失败模式。我们还提供了一个微调实验，表明在一个子集上训练的小型语言模型（17亿参数）缩小了与更大基线模型（300亿参数）的差距，从而凸显了该数据集对于高效抽取的效用。ScrapeGraphAI-100k可用于微调小型模型、对结构化抽取进行基准测试、研究面向网络信息检索索引的模式归纳，并已在HuggingFace上公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【EMNLP2025最佳论文】INFINI-GRAM MINI：基于 FM-Index 的互联网级精确 n-gram 搜索

专知会员服务

13+阅读 · 2025年11月9日

【新书】使用大型语言模型进行数据分析：文本、表格、图像与音频

专知会员服务

43+阅读 · 2025年4月16日

基于大语言模型（LLM）的合成数据生成、策展和评估的综述

专知会员服务

62+阅读 · 2024年7月5日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日