ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

Producing output that conforms to a specified JSON schema underlies tool use, structured extraction, and knowledge base construction in modern large language models. Despite this centrality, public datasets for the task remain small, synthetic, or text-only, and rarely pair real page content with the prompts and schemas used in practice. We introduce ScrapeGraphAI-100k, 93,695 schema-constrained extraction events collected via opt-in ScrapeGraphAI telemetry in Q2--Q3 2025, deduplicated and balanced by schema from 9M raw events. The corpus spans 18 000+ unique schemas across 15 named languages plus a long-tail Other category, with English and Traditional Chinese covering 88% of detected content, each instance pairs Markdown-converted page content with a prompt, schema, LLM response, and per-example jsonschema-rs structural conformance labels (semantic correctness is out of scope, and raw HTML is deferred beyond v1.0). We characterize structural diversity across the corpus and identify sharp failure thresholds as schema complexity grows. As a case study, a 1.7B student fine-tuned on this data closely tracks the output distribution of its GPT-5-nano teacher, though it still trails a 30B-A3B reference (3.3B active parameters) on schema compliance. We offer this distillation result as preliminary evidence that grounding schema-constrained generation in real practitioner workloads at scale enables training and benchmarking that prior synthetic or text-only corpora could not support.

翻译：生成符合指定JSON模式的输出是现代大语言模型实现工具使用、结构化抽取和知识库构建的基础。尽管该任务至关重要，但现有的公开数据集规模小、多为合成数据或仅含文本，且很少将真实网页内容与实际应用中使用的提示词和模式相结合。我们推出了ScrapeGraphAI-100k数据集，包含2025年第二至第三季度期间通过ScrapeGraphAI可选的遥测收集的93,695个经模式约束的抽取事件，这些事件经过去重处理，并按模式从900万原始事件中实现平衡。该语料库涵盖18,000余个独特模式，涉及15个已标注语言类别及一个长尾"其他"类别，其中英语和简体中文覆盖了88%的检测内容。每个实例均将Markdown转换后的网页内容与提示词、模式、大语言模型响应以及基于jsonschema-rs的逐样本结构合规性标签（语义正确性不在本次研究范围内，原始HTML内容将延至v1.0版本之后处理）配对呈现。我们分析了整个语料库的结构多样性，并揭示了随着模式复杂度增加而出现的显著失败阈值。作为案例研究，一个基于该数据微调的1.7B参数学生模型在输出分布上紧密跟随其GPT-5-nano教师模型，尽管在模式合规性上仍落后于30B-A3B参数参考模型（3.3B活跃参数）。我们提供这一蒸馏结果作为初步证据，表明基于真实从业者大规模工作负载来约束模式生成，能够支持先前合成或纯文本语料库所无法支持的训练与基准测试。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【EMNLP2025】ReCode：基于细粒度检索增强生成的LLM代码修复方法

专知会员服务

10+阅读 · 2025年9月3日

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日

揭示生成式人工智能 / 大型语言模型（LLMs）的军事潜力

专知会员服务

32+阅读 · 2024年9月26日

【CIKM2024】LLM蒸馏到GNN，性能提升6.2%！Emory提出大模型蒸馏到文本图｜CIKM 2024

专知会员服务

23+阅读 · 2024年8月22日