DiagramBank: A Quality-Audited Dataset of Scientific Schematic Diagrams with Multi-Level Document Context

Scientific papers use schematic diagrams to communicate methods, workflows, and system structure, yet existing scientific-figure corpora often mix them with plots, screenshots, and photographs and rarely preserve document context. We introduce DiagramBank, a quality-audited dataset of 57,100 schematic diagrams curated from OpenReview-hosted AI/ML venues. Each record links a diagram image to its paper title, abstract, figure caption, in-text figure-reference spans, venue/year metadata, provenance fields, and filtering labels. DiagramBank is a reusable resource for scientific-document understanding, diagram retrieval, corpus analysis, and future benchmark construction. We describe its extraction and cascade-filtering pipeline, release schema, confidence-controlled views, dataset card, and indexing utilities. A manual blind audit of the released cascade-filtered records estimates 93.67% precision, and a separate CLIP threshold analysis characterizes the precision--coverage trade-off for simpler filtering views. We further provide lightweight metadata-indexing and authoring examples to illustrate downstream protocols without treating these utilities as standalone methods. The code is public at: https://github.com/csml-rpi/DiagramBank.

翻译：科学论文使用示意图来传达方法、工作流程和系统结构，然而现有的科学图形语料库通常将示意图与曲线图、屏幕截图和照片混在一起，且很少保留文档上下文。我们引入了DiagramBank，一个从OpenReview主办的AI/ML会议中精选出的57,100个示意图的质量审计数据集。每条记录将示意图图像与其论文标题、摘要、图形标题、文本中的图形引用跨度、会议/年份元数据、出处字段和过滤标签相关联。DiagramBank是用于科学文档理解、示意图检索、语料库分析以及未来基准构建的可重用资源。我们描述了其提取和级联过滤流水线、发布模式、置信度控制视图、数据集卡片和索引工具。对已发布的级联过滤记录进行的手动盲审估计精度为93.67%，另一项独立的CLIP阈值分析则刻画了更简单过滤视图的精度-覆盖权衡。我们进一步提供了轻量级元数据索引和创作示例，以说明下游协议，而不将这些工具视为独立方法。代码公开于：https://github.com/csml-rpi/DiagramBank。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

图上如何提示？港中文等最新《图提示学习》全面综述，详述图提示分类体系

专知会员服务

42+阅读 · 2023年12月1日

收藏！ChatGPT数据科学提示速查表，60多个数据科学任务的ChatGPT提示，78页pdf

专知会员服务

106+阅读 · 2023年4月2日

【腾讯等】可信赖图学习：可靠性、可解释性和隐私保护，A Survey of Trustworthy Graph Learning: Reliability, Explainability, and Privacy Protection

专知会员服务

20+阅读 · 2022年5月24日