Recent approaches to automatically detect the speaker of an utterance of direct speech often disregard general information about characters in favor of local information found in the context, such as surrounding mentions of entities. In this work, we explore stylistic representations of characters built by encoding their quotes with off-the-shelf pretrained Authorship Verification models in a large corpus of English novels (the Project Dialogism Novel Corpus). Results suggest that the combination of stylistic and topical information captured in some of these models accurately distinguish characters among each other, but does not necessarily improve over semantic-only models when attributing quotes. However, these results vary across novels and more investigation of stylometric models particularly tailored for literary texts and the study of characters should be conducted.
翻译:近期自动检测直接引语说话者的方法往往忽略关于角色的全局信息,转而依赖上下文中的局部信息(如周围提及的实体)。本研究在大型英语小说语料库(项目对话主义小说语料库)中,通过使用现成的预训练作者身份验证模型对引语进行编码,探索了角色的文体表征。结果表明,部分模型所捕获的文体与主题信息组合,能够准确区分不同角色,但在引语归属任务中并不一定优于仅依赖语义信息的模型。然而,这些结果在不同小说间存在差异,有必要针对文学文本及角色研究进一步探索特别设计的文体计量模型。