Interacting and understanding with text heavy visual content with multiple images is a major challenge for traditional vision models. This paper is on enhancing vision models' capability to comprehend or understand and learn from images containing a huge amount of textual information from the likes of textbooks and research papers which contain multiple images like graphs, etc and tables in them with different types of axes and scales. The approach involves dataset preprocessing, fine tuning which is by using instructional oriented data and evaluation. We also built a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark which is developed to consider both textual and visual inputs. An accuracy of 96.71% was obtained. The aim of the project is to increase and also enhance the advance vision models' capabilities in understanding complex visual textual data interconnected data, contributing to multimodal AI.
翻译:传统视觉模型在处理包含多幅图像的文本密集型视觉内容时面临重大挑战。本文旨在提升视觉模型对包含海量文本信息的图像(如含有多类图表、具有不同坐标轴与比例尺的科研论文与教科书)的理解与学习能力。研究方法包括数据集预处理、基于指令导向数据的微调以及性能评估。我们还开发了一款集成CLIP图像编码器与大规模文本嵌入基准模型的视觉对话应用,该模型专为同时处理文本与视觉输入而设计。实验取得了96.71%的准确率。本项目致力于提升先进视觉模型对复杂视觉文本关联数据的理解能力,为多模态人工智能的发展作出贡献。