ProcTag：面向文档指令数据效能评估的过程标注方法 (ProcTag: Process Tagging for Assessing the Efficacy of Document Instruction Data)

Recently, large language models (LLMs) and multimodal large language models (MLLMs) have demonstrated promising results on document visual question answering (VQA) task, particularly after training on document instruction datasets. An effective evaluation method for document instruction data is crucial in constructing instruction data with high efficacy, which, in turn, facilitates the training of LLMs and MLLMs for document VQA. However, most existing evaluation methods for instruction data are limited to the textual content of the instructions themselves, thereby hindering the effective assessment of document instruction datasets and constraining their construction. In this paper, we propose ProcTag, a data-oriented method that assesses the efficacy of document instruction data. ProcTag innovatively performs tagging on the execution process of instructions rather than the instruction text itself. By leveraging the diversity and complexity of these tags to assess the efficacy of the given dataset, ProcTag enables selective sampling or filtering of document instructions. Furthermore, DocLayPrompt, a novel semi-structured layout-aware document prompting strategy, is proposed for effectively representing documents. Experiments demonstrate that sampling existing open-sourced and generated document VQA/instruction datasets with ProcTag significantly outperforms current methods for evaluating instruction data. Impressively, with ProcTag-based sampling in the generated document datasets, only 30.5\% of the document instructions are required to achieve 100\% efficacy compared to the complete dataset. The code is publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/ProcTag.

翻译：近年来，大型语言模型（LLMs）与多模态大型语言模型（MLLMs）在文档视觉问答（VQA）任务上展现出令人瞩目的成果，尤其是在经过文档指令数据集训练之后。一种有效的文档指令数据评估方法对于构建高效能的指令数据至关重要，进而有助于训练面向文档VQA的LLMs与MLLMs。然而，现有的大多数指令数据评估方法仅局限于指令本身的文本内容，这阻碍了对文档指令数据集的有效评估，并制约了其构建。本文提出ProcTag，一种面向数据的、用于评估文档指令数据效能的方法。ProcTag创新性地对指令的执行过程而非指令文本本身进行标注。通过利用这些标注的多样性与复杂性来评估给定数据集的效能，ProcTag能够实现对文档指令的选择性采样或过滤。此外，本文提出一种新颖的半结构化布局感知文档提示策略DocLayPrompt，用于有效表征文档。实验表明，使用ProcTag对现有开源及生成的文档VQA/指令数据集进行采样，其性能显著优于当前评估指令数据的方法。值得注意的是，在生成的文档数据集中，基于ProcTag的采样仅需使用30.5%的文档指令即可达到与完整数据集相当的100%效能。代码已公开于https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/ProcTag。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日