DragonVerseQA: Open-Domain Long-Form Context-Aware Question-Answering

This paper proposes a novel approach to develop an open-domain and long-form Over-The-Top (OTT) Question-Answering (QA) dataset, DragonVerseQA, specifically oriented to the fantasy universe of "House of the Dragon" and "Game Of Thrones" TV series. Most existing QA datasets focus on short, fact-based answers sourced almost solely from Wikipedia articles, devoid of depth and contextual richness for sophisticated narrative understanding. We curate a dataset that combines full episode summaries sourced from HBO and fandom wiki websites, user reviews from sources like IMDb and Rotten Tomatoes, and high-quality, open-domain, legally admissible sources, and structured data from repositories like WikiData into one dataset. The dataset provides a multi-dimensional context, reflecting complex character dynamics and plot developments from these varied sources. That means, on equal footing, only after heavy data preprocessing and filtering methods will meaningful, non-spam unbiased reviews be available in this enriched dataset. The comprehensive insights are given through the long-form answers generated from this enriched context. This is what makes this valuable dataset for improving conversational AI, narrative analysis, sentiment analysis, summarization techniques, and relation extraction. A comparative analysis with state-of-the-art QA datasets such as SQuAD 2.0, TriviaQA, and Natural Questions brings to light the unique advantages of our dataset in terms of contextual complexity and answer length. Detailed reviews add layers to audience sentiment and narrative interpretation, raising the bar for domain-specific QA with a new quality benchmark. Our work also allows a deeper understanding of entertainment-industry content and opens the door to more knowledgeable and creative AI-driven interactions within digital media environments.

翻译：本文提出一种创新方法，用于构建开放域长篇OTT（Over-The-Top）问答数据集DragonVerseQA，该数据集专门面向《龙之家族》与《权力的游戏》电视剧的奇幻宇宙。现有大多数问答数据集聚焦于基于事实的简短答案，其来源几乎完全依赖维基百科文章，缺乏复杂叙事理解所需的深度与上下文丰富性。我们通过整合来自HBO及粉丝维基网站的全剧集摘要、IMDb与烂番茄等平台的用户评论、高质量开放域合法可采信资源，以及WikiData等知识库的结构化数据，构建了一个多维上下文数据集。该数据集通过融合多元来源，呈现了复杂的角色关系与情节发展脉络。这意味着经过大量数据预处理与过滤后，本增强数据集中将仅包含有意义的非垃圾无偏评论。基于此增强上下文生成的长篇答案提供了全面洞察，使得本数据集在提升对话式人工智能、叙事分析、情感分析、摘要生成技术与关系抽取方面具有重要价值。通过与SQuAD 2.0、TriviaQA及Natural Questions等前沿问答数据集的对比分析，凸显了本数据集在上下文复杂度与答案长度方面的独特优势。详尽的评论数据为受众情感与叙事解读增添了层次，为领域特定问答设立了新的质量基准。本研究不仅深化了对娱乐产业内容的理解，更为数字媒体环境中知识更渊博、更具创造性的AI驱动交互开启了新可能。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日