Recent advances in computer vision (CV) and natural language processing have been driven by exploiting big data on practical applications. However, these research fields are still limited by the sheer volume, versatility, and diversity of the available datasets. CV tasks, such as image captioning, which has primarily been carried out on natural images, still struggle to produce accurate and meaningful captions on sketched images often included in scientific and technical documents. The advancement of other tasks such as 3D reconstruction from 2D images requires larger datasets with multiple viewpoints. We introduce DeepPatent2, a large-scale dataset, providing more than 2.7 million technical drawings with 132,890 object names and 22,394 viewpoints extracted from 14 years of US design patent documents. We demonstrate the usefulness of DeepPatent2 with conceptual captioning. We further provide the potential usefulness of our dataset to facilitate other research areas such as 3D image reconstruction and image retrieval.
翻译:近年来,计算机视觉与自然语言处理领域的进步依赖于对实际应用中海量数据的挖掘。然而,现有数据集在规模、多样性与丰富性上的局限仍制约着这些研究领域的发展。例如,图像描述等计算机视觉任务主要针对自然图像展开,但在科学及技术文档中常见的草图图像上,仍难以生成准确且有意义的描述。从二维图像进行三维重建等任务的推进需要包含多视角的更大规模数据集。我们提出了DeepPatent2——一个大规模数据集,该数据集从14年的美国外观设计专利文献中提取了超过270万张技术图纸、132,890个对象名称及22,394个视角。通过概念性图像描述任务,我们验证了DeepPatent2的实用价值。此外,我们还展示了该数据集在推动三维图像重建、图像检索等其他研究领域的潜在应用前景。