Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows for handling extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, and modern assisting tools like Large Language Models (LLM) are not utilized to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments. We executed this pipeline and introduced the PUGG dataset, the first Polish KBQA dataset, and novel datasets for MRC and IR. Additionally, we provide a comprehensive implementation, insightful findings, detailed statistics, and evaluation of baseline models.
翻译:人工智能与自然语言处理领域的进步彻底改变了机器与人类语言交互的方式,其中问答系统发挥着关键作用。基于知识库的问答任务通过利用结构化知识图谱,能够处理大量知识密集型问题。然而,KBQA数据集领域存在显著空白,尤其对于低资源语言而言。现有许多数据集构建流程已过时且人力效率低下,同时未能利用大型语言模型等现代辅助工具来减轻工作量。为此,我们设计并实现了一种适用于低资源环境的现代半自动化数据集构建方法,涵盖KBQA、机器阅读理解与信息检索任务。通过执行该流程,我们推出了首个波兰语KBQA数据集PUGG,以及创新的MRC与IR数据集。此外,我们提供了完整的实现方案、深入的研究发现、详细的数据统计及基线模型评估结果。