Humans seek information regarding a specific topic through performing a conversation containing a series of questions and answers. In the pursuit of conversational question answering research, we introduce the PCoQA, the first \textbf{P}ersian \textbf{Co}nversational \textbf{Q}uestion \textbf{A}nswering dataset, a resource comprising information-seeking dialogs encompassing a total of 9,026 contextually-driven questions. Each dialog involves a questioner, a responder, and a document from the Wikipedia; The questioner asks several inter-connected questions from the text and the responder provides a span of the document as the answer for each question. PCoQA is designed to present novel challenges compared to previous question answering datasets including having more open-ended non-factual answers, longer answers, and fewer lexical overlaps. This paper not only presents the comprehensive PCoQA dataset but also reports the performance of various benchmark models. Our models include baseline models and pre-trained models, which are leveraged to boost the performance of the model. The dataset and benchmarks are available at our Github page.
翻译:人类通过包含一系列问答的对话来获取关于特定主题的信息。在对话式问答研究领域,我们推出了首个波斯语对话式问答数据集PCoQA,该资源包含共计9,026个基于语境驱动的信息获取型对话。每个对话包含提问者、回答者以及来自维基百科的文档:提问者针对文本提出多个相互关联的问题,回答者则从文档中选取片段作为每个问题的答案。与以往的问答数据集相比,PCoQA引入了全新挑战,包括更多开放式非事实性答案、更长的答案以及更少的词汇重叠。本文不仅介绍了完整的PCoQA数据集,还报告了多种基准模型的性能表现。我们的模型包括基线模型和预训练模型,这些模型被用于提升系统表现。该数据集和基准模型已在我们的GitHub页面公开。