PDFs are the second-most used document type on the internet (after HTML). Yet, existing QA datasets commonly start from text sources or only address specific domains. In this paper, we present pdfQA, a multi-domain 2K human-annotated (real-pdfQA) and 2K synthetic dataset (syn-pdfQA) differentiating QA pairs in ten complexity dimensions (e.g., file type, source modality, source position, answer type). We apply and evaluate quality and difficulty filters on both datasets, obtaining valid and challenging QA pairs. We answer the questions with open-source LLMs, revealing existing challenges that correlate with our complexity dimensions. pdfQA presents a basis for end-to-end QA pipeline evaluation, testing diverse skill sets and local optimizations (e.g., in information retrieval or parsing).
翻译:PDF是互联网上使用量第二大的文档类型(仅次于HTML)。然而,现有问答数据集通常基于文本源构建或仅针对特定领域。本文提出了pdfQA数据集,该多领域数据集包含2K人工标注样本(real-pdfQA)与2K合成样本(syn-pdfQA),其问答对在十个复杂度维度上进行区分(如文件类型、来源模态、来源位置、答案类型)。我们对两个数据集实施质量与难度筛选,获得有效且具挑战性的问答对。通过开源大语言模型进行问题解答,揭示了与复杂度维度相关的现存挑战。pdfQA为端到端问答流程评估提供了基准,能够测试多样化技能组合与局部优化策略(如信息检索或解析环节)。