BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Juan Rodriguez,Xiangru Jian,Siba Smarak Panigrahi,Tianyu Zhang,Aarash Feizi,Abhay Puri,Akshay Kalkunte,François Savard,Ahmed Masry,Shravan Nayak,Rabiul Awal,Mahsa Massoud,Amirhossein Abaskohi,Zichao Li,Suyuchen Wang,Pierre-André Noël,Mats Leon Richter,Saverio Vadacchino,Shubbam Agarwal,Sanket Biswas,Sara Shanian,Ying Zhang,Noah Bolger,Kurt MacDonald,Simon Fauvel,Sathwik Tejaswi,Srinivas Sunkara,Joao Monteiro,Krishnamurthy DJ Dvijotham,Torsten Scholak,Nicolas Chapados,Sepideh Kharagani,Sean Hughes,M. Özsu,Siva Reddy,Marco Pedersoli,Yoshua Bengio,Christopher Pal,Issam Laradji,Spandanna Gella,Perouz Taslakian,David Vazquez,Sai Rajeswar

from arxiv, The project is hosted at https://bigdocs.github.io

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

翻译：多模态人工智能在文档理解任务（如处理收据、理解工作流程、从文档中提取数据以及总结报告）方面具有显著提升的潜力。需要长结构化输出的代码生成任务也能通过多模态技术得到增强。尽管如此，其在商业应用中的使用常因训练数据获取有限和许可限制而受到制约，阻碍了开放访问。为应对这些限制，我们推出了BigDocs-7.5M——一个高质量、开放访问的数据集，包含跨越30个任务的750万份多模态文档。我们采用高效的数据整理流程，确保数据质量高且许可宽松。该流程通过过滤规则、可追溯的元数据和细致的内容分析，强调问责制、责任心和透明度。此外，我们引入了BigDocs-Bench，这是一个包含10个新颖任务的基准测试套件，其中我们创建了反映真实世界用例的数据集，涉及图形用户界面（GUI）推理和从图像生成代码。我们的实验表明，在文档推理和结构化输出任务（如Screenshot2HTML或Image2Latex生成）中，使用BigDocs-Bench进行训练可将平均性能较闭源的GPT-4o提升高达25.8%。最后，人工评估显示，相较于GPT-4o，受试者更偏好基于BigDocs训练的模型输出。这表明BigDocs有助于学术界和开源社区利用并改进人工智能工具，以增强多模态能力和文档推理。项目托管于 https://bigdocs.github.io。