This paper proposes a novel approach to develop an open-domain and long-form Over-The-Top (OTT) Question-Answering (QA) dataset, DragonVerseQA, specifically oriented to the fantasy universe of "House of the Dragon" and "Game Of Thrones" TV series. Most existing QA datasets focus on short, fact-based answers sourced almost solely from Wikipedia articles, devoid of depth and contextual richness for sophisticated narrative understanding. We curate a dataset that combines full episode summaries sourced from HBO and fandom wiki websites, user reviews from sources like IMDb and Rotten Tomatoes, and high-quality, open-domain, legally admissible sources, and structured data from repositories like WikiData into one dataset. The dataset provides a multi-dimensional context, reflecting complex character dynamics and plot developments from these varied sources. That means, on equal footing, only after heavy data preprocessing and filtering methods will meaningful, non-spam unbiased reviews be available in this enriched dataset. The comprehensive insights are given through the long-form answers generated from this enriched context. This is what makes this valuable dataset for improving conversational AI, narrative analysis, sentiment analysis, summarization techniques, and relation extraction. A comparative analysis with state-of-the-art QA datasets such as SQuAD 2.0, TriviaQA, and Natural Questions brings to light the unique advantages of our dataset in terms of contextual complexity and answer length. Detailed reviews add layers to audience sentiment and narrative interpretation, raising the bar for domain-specific QA with a new quality benchmark. Our work also allows a deeper understanding of entertainment-industry content and opens the door to more knowledgeable and creative AI-driven interactions within digital media environments.
翻译:本文提出一种创新方法,用于构建开放域长篇OTT(Over-The-Top)问答数据集DragonVerseQA,该数据集专门面向《龙之家族》与《权力的游戏》电视剧的奇幻宇宙。现有大多数问答数据集聚焦于基于事实的简短答案,其来源几乎完全依赖维基百科文章,缺乏复杂叙事理解所需的深度与上下文丰富性。我们通过整合来自HBO及粉丝维基网站的全剧集摘要、IMDb与烂番茄等平台的用户评论、高质量开放域合法可采信资源,以及WikiData等知识库的结构化数据,构建了一个多维上下文数据集。该数据集通过融合多元来源,呈现了复杂的角色关系与情节发展脉络。这意味着经过大量数据预处理与过滤后,本增强数据集中将仅包含有意义的非垃圾无偏评论。基于此增强上下文生成的长篇答案提供了全面洞察,使得本数据集在提升对话式人工智能、叙事分析、情感分析、摘要生成技术与关系抽取方面具有重要价值。通过与SQuAD 2.0、TriviaQA及Natural Questions等前沿问答数据集的对比分析,凸显了本数据集在上下文复杂度与答案长度方面的独特优势。详尽的评论数据为受众情感与叙事解读增添了层次,为领域特定问答设立了新的质量基准。本研究不仅深化了对娱乐产业内容的理解,更为数字媒体环境中知识更渊博、更具创造性的AI驱动交互开启了新可能。