Producing output that conforms to a specified JSON schema underlies tool use, structured extraction, and knowledge base construction in modern large language models. Despite this centrality, public datasets for the task remain small, synthetic, or text-only, and rarely pair real page content with the prompts and schemas used in practice. We introduce ScrapeGraphAI-100k, 93,695 schema-constrained extraction events collected via opt-in ScrapeGraphAI telemetry in Q2--Q3 2025, deduplicated and balanced by schema from 9M raw events. The corpus spans 18 000+ unique schemas across 15 named languages plus a long-tail Other category, with English and Traditional Chinese covering 88% of detected content, each instance pairs Markdown-converted page content with a prompt, schema, LLM response, and per-example jsonschema-rs structural conformance labels (semantic correctness is out of scope, and raw HTML is deferred beyond v1.0). We characterize structural diversity across the corpus and identify sharp failure thresholds as schema complexity grows. As a case study, a 1.7B student fine-tuned on this data closely tracks the output distribution of its GPT-5-nano teacher, though it still trails a 30B-A3B reference (3.3B active parameters) on schema compliance. We offer this distillation result as preliminary evidence that grounding schema-constrained generation in real practitioner workloads at scale enables training and benchmarking that prior synthetic or text-only corpora could not support.
翻译:生成符合指定JSON模式的输出是现代大语言模型实现工具使用、结构化抽取和知识库构建的基础。尽管该任务至关重要,但现有的公开数据集规模小、多为合成数据或仅含文本,且很少将真实网页内容与实际应用中使用的提示词和模式相结合。我们推出了ScrapeGraphAI-100k数据集,包含2025年第二至第三季度期间通过ScrapeGraphAI可选的遥测收集的93,695个经模式约束的抽取事件,这些事件经过去重处理,并按模式从900万原始事件中实现平衡。该语料库涵盖18,000余个独特模式,涉及15个已标注语言类别及一个长尾"其他"类别,其中英语和简体中文覆盖了88%的检测内容。每个实例均将Markdown转换后的网页内容与提示词、模式、大语言模型响应以及基于jsonschema-rs的逐样本结构合规性标签(语义正确性不在本次研究范围内,原始HTML内容将延至v1.0版本之后处理)配对呈现。我们分析了整个语料库的结构多样性,并揭示了随着模式复杂度增加而出现的显著失败阈值。作为案例研究,一个基于该数据微调的1.7B参数学生模型在输出分布上紧密跟随其GPT-5-nano教师模型,尽管在模式合规性上仍落后于30B-A3B参数参考模型(3.3B活跃参数)。我们提供这一蒸馏结果作为初步证据,表明基于真实从业者大规模工作负载来约束模式生成,能够支持先前合成或纯文本语料库所无法支持的训练与基准测试。