The use of large language models for web information extraction is becoming increasingly fundamental to modern web information retrieval pipelines. However, existing datasets tend to be small, synthetic or text-only, failing to capture the structural context of the web. We introduce ScrapeGraphAI-100k, a large-scale dataset comprising real-world LLM extraction events, collected via opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains and languages. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. We characterize the datasets structural diversity and its failure modes as schema complexity increases. We also provide a fine-tuning experiment showing that a small language model (1.7B) trained on a subset narrows the gap to larger baselines (30B), underscoring the datasets utility for efficient extraction. ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
翻译:利用大语言模型进行网络信息抽取正日益成为现代网络信息检索流程的基础。然而,现有数据集往往规模较小、合成生成或仅为纯文本,未能捕捉网络的结构化上下文。我们介绍了ScrapeGraphAI-100k,这是一个大规模数据集,包含真实世界的大语言模型抽取事件,通过ScrapeGraphAI在2025年第二和第三季度自愿加入的遥测数据收集而来。我们从900万个事件出发,通过模式去重和平衡,最终得到93,695个示例,涵盖多样化的领域和语言。每个实例均包含Markdown内容、提示词、JSON模式、大语言模型响应以及复杂度与验证元数据。我们描述了数据集的结构多样性及其在模式复杂度增加时的失败模式。我们还提供了一个微调实验,表明在一个子集上训练的小型语言模型(17亿参数)缩小了与更大基线模型(300亿参数)的差距,从而凸显了该数据集对于高效抽取的效用。ScrapeGraphAI-100k可用于微调小型模型、对结构化抽取进行基准测试、研究面向网络信息检索索引的模式归纳,并已在HuggingFace上公开提供。