In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.
翻译:在信息与通信技术(ICT)行业中,训练领域专用大语言模型(LLM)或构建检索增强生成系统需要大量高价值的领域知识。然而,这些知识不仅隐藏在文本模态中,也存在于图像模态中。传统方法能够解析领域文档中的文本,但缺乏图像描述生成能力。多模态大语言模型(MLLM)虽能理解图像,却缺乏足够的领域知识。为解决上述问题,本文提出一种多阶段渐进式训练策略,用于训练ICT领域的专用图像描述生成模型(DICModel),并构建了标准化评估体系以验证DICModel的性能。具体而言,本研究首先通过结合Mermaid工具与LLM合成了约7K图像-文本对,用于DICModel的第一阶段监督微调(SFT)。随后,ICT领域专家手动标注约2K图像-文本对,用于DICModel的第二阶段SFT。最后,专家与LLM联合构建约1.5K视觉问答数据,用于基于指令的SFT。实验结果表明,我们仅含70亿参数的DICModel性能优于其他320亿参数的先进模型。与70亿参数和320亿参数的SOTA模型相比,我们的DICModel将BLEU指标分别提升了约56.8%和20.8%。在ICT领域专家构建的客观题测试中,DICModel的准确率较Qwen2.5-VL 320亿模型高出1%。综上所述,本工作能够高效精准地从图像中提取逻辑文本,有望推动ICT领域多模态模型的发展。