BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

Juan Rodriguez,Xiangru Jian,Siba Smarak Panigrahi,Tianyu Zhang,Aarash Feizi,Abhay Puri,Akshay Kalkunte,François Savard,Ahmed Masry,Shravan Nayak,Rabiul Awal,Mahsa Massoud,Amirhossein Abaskohi,Zichao Li,Suyuchen Wang,Pierre-André Noël,Mats Leon Richter,Saverio Vadacchino,Shubham Agarwal,Sanket Biswas,Sara Shanian,Ying Zhang,Noah Bolger,Kurt MacDonald,Simon Fauvel,Sathwik Tejaswi,Srinivas Sunkara,Joao Monteiro,Krishnamurthy DJ Dvijotham,Torsten Scholak,Nicolas Chapados,Sepideh Kharagani,Sean Hughes,M. Özsu,Siva Reddy,Marco Pedersoli,Yoshua Bengio,Christopher Pal,Issam Laradji,Spandana Gella,Perouz Taslakian,David Vazquez,Sai Rajeswar

from arxiv, The project is hosted at https://bigdocs.github.io

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

翻译：多模态人工智能有潜力显著增强文档理解任务，例如处理收据、理解工作流程、从文档中提取数据以及总结报告。需要长结构化输出的代码生成任务也能通过多模态技术得到增强。尽管如此，它们在商业应用中的使用往往因训练数据获取有限和许可限制而受到制约，这阻碍了开放访问。为应对这些限制，我们推出了BigDocs-7.5M——一个高质量、开放访问的数据集，包含跨越30个任务的750万份多模态文档。我们采用高效的数据整理流程，确保数据质量优异且许可宽松。该流程通过过滤规则、可追溯的元数据和细致的内容分析，强调问责性、责任感和透明度。此外，我们推出了BigDocs-Bench基准测试套件，包含10个新颖任务，其中我们创建的数据集反映了涉及图形用户界面（GUI）推理和图像生成代码等真实世界用例。我们的实验表明，在文档推理和结构化输出任务（如Screenshot2HTML或Image2Latex生成）中，使用BigDocs-Bench进行训练相比闭源的GPT-4o平均性能提升最高达25.8%。最后，人工评估显示，相较于GPT-4o，人们更偏好基于BigDocs训练模型生成的输出。这表明BigDocs能帮助学术界和开源社区利用并改进AI工具，以增强多模态能力和文档推理水平。项目主页：https://bigdocs.github.io。