Scientific papers use schematic diagrams to communicate methods, workflows, and system structure, yet existing scientific-figure corpora often mix them with plots, screenshots, and photographs and rarely preserve document context. We introduce DiagramBank, a quality-audited dataset of 57,100 schematic diagrams curated from OpenReview-hosted AI/ML venues. Each record links a diagram image to its paper title, abstract, figure caption, in-text figure-reference spans, venue/year metadata, provenance fields, and filtering labels. DiagramBank is a reusable resource for scientific-document understanding, diagram retrieval, corpus analysis, and future benchmark construction. We describe its extraction and cascade-filtering pipeline, release schema, confidence-controlled views, dataset card, and indexing utilities. A manual blind audit of the released cascade-filtered records estimates 93.67% precision, and a separate CLIP threshold analysis characterizes the precision--coverage trade-off for simpler filtering views. We further provide lightweight metadata-indexing and authoring examples to illustrate downstream protocols without treating these utilities as standalone methods. The code is public at: https://github.com/csml-rpi/DiagramBank.
翻译:科学论文使用示意图来传达方法、工作流程和系统结构,然而现有的科学图形语料库通常将示意图与曲线图、屏幕截图和照片混在一起,且很少保留文档上下文。我们引入了DiagramBank,一个从OpenReview主办的AI/ML会议中精选出的57,100个示意图的质量审计数据集。每条记录将示意图图像与其论文标题、摘要、图形标题、文本中的图形引用跨度、会议/年份元数据、出处字段和过滤标签相关联。DiagramBank是用于科学文档理解、示意图检索、语料库分析以及未来基准构建的可重用资源。我们描述了其提取和级联过滤流水线、发布模式、置信度控制视图、数据集卡片和索引工具。对已发布的级联过滤记录进行的手动盲审估计精度为93.67%,另一项独立的CLIP阈值分析则刻画了更简单过滤视图的精度-覆盖权衡。我们进一步提供了轻量级元数据索引和创作示例,以说明下游协议,而不将这些工具视为独立方法。代码公开于:https://github.com/csml-rpi/DiagramBank。