Characters are at the heart of every story, driving the plot and engaging readers. In this study, we explore the understanding of characters in full-length books, which contain complex narratives and numerous interacting characters. We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation, including character development, personality, and social context. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Using this dataset, we evaluate state-of-the-art long-context models in zero-shot and fine-tuning settings, utilizing both retrieval-based and hierarchical processing for book-length inputs. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks. Additionally, fine-tuned models using coreference-based retrieval produce the most factual descriptions, as measured by fact- and entailment-based metrics. We hope our dataset, experiments, and analysis will inspire further research in character-based narrative understanding.
翻译:角色是每个故事的核心,推动情节发展并吸引读者。本研究探讨了对包含复杂叙事和众多互动角色的完整书籍中角色的理解。我们定义了两项任务:角色描述(生成简短的事实性简介)和角色分析(提供深入解读,包括角色发展、性格及社会背景)。我们引入了BookWorm数据集,将古登堡计划中的书籍与人工撰写的描述和分析进行配对。利用该数据集,我们在零样本和微调设置下评估了最先进的长上下文模型,并采用基于检索和分层处理的方法处理书籍长度的输入。研究结果表明,在这两项任务中,基于检索的方法均优于分层方法。此外,通过基于事实和蕴含关系的指标测量,使用基于指代消解的检索进行微调的模型能生成最具事实性的描述。我们希望本数据集、实验及分析能激发基于角色的叙事理解领域的进一步研究。