Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities.
翻译:近年来,大型多模态模型(LMMs)通过利用大规模多模态数据集,在复杂知识驱动任务中提升了能力。然而,感知与推理错误仍是持续存在的挑战,限制了其效能,尤其在解析复杂视觉数据与推断多模态关系方面。为应对这些问题,我们提出一种新颖的数据集格式——PIN(配对与交错多模态文档),旨在显著提升多模态训练的深度与广度。PIN格式基于三个基本原则构建:知识密集性、可扩展性以及对多样化训练模式的支持。该创新格式结合了Markdown文件与综合性图像,通过密集的知识结构和灵活的训练策略来丰富训练数据。我们推出了PIN-14M,这是一个包含1400万个样本的开源数据集,源自多样化的中英文资源,专门涵盖复杂的网络与科学内容。该数据集经过精心构建,以确保数据质量与伦理完整性,旨在促进先进的训练策略,并提升模型对常见多模态训练缺陷的鲁棒性。作为本技术报告的基础,我们的初步结果表明PIN格式在优化LMM性能方面具有显著潜力,未来计划进一步扩展数据集,并详细评估其对模型能力的影响。