The recent trend in the Large Vision and Language model has brought a new change in how information extraction systems are built. VLMs have set a new benchmark with their State-of-the-art techniques in understanding documents and building question-answering systems across various industries. They are significantly better at generating text from document images and providing accurate answers to questions. However, there are still some challenges in effectively utilizing these models to build a precise conversational system. General prompting techniques used with large language models are often not suitable for these specially designed vision language models. The output generated by such generic input prompts is ordinary and may contain information gaps when compared with the actual content of the document. To obtain more accurate and specific answers, a well-targeted prompt is required by the vision language model, along with the document image. In this paper, a technique is discussed called Target prompting, which focuses on explicitly targeting parts of document images and generating related answers from those specific regions only. The paper also covers the evaluation of response for each prompting technique using different user queries and input prompts.
翻译:近年来,大型视觉语言模型的发展为信息抽取系统的构建带来了新的变革。视觉语言模型凭借其在文档理解和跨行业问答系统构建方面的先进技术,树立了新的性能基准。这些模型在从文档图像生成文本以及为问题提供准确答案方面表现出显著优势。然而,如何有效利用这些模型构建精确的对话系统仍面临一些挑战。通常用于大型语言模型的通用提示技术往往不适用于这些专门设计的视觉语言模型。此类通用输入提示生成的输出结果较为普通,且与文档实际内容相比可能存在信息缺口。为了获得更准确、更具体的答案,视觉语言模型需要结合文档图像使用针对性强的提示。本文讨论了一种称为目标提示的技术,该技术专注于显式定位文档图像的特定部分,并仅从这些特定区域生成相关答案。本文还涵盖了使用不同用户查询和输入提示对每种提示技术的响应效果进行评估的内容。