Language models that are sensitive to external context can more effectively capture the speaking patterns of individuals with specific characteristics or in particular environments. However, obtaining and leveraging such annotations can be challenging. In this work, we show how to leverage rich character and film annotations to personalise language models in a scalable manner. Our best model can reduce perplexity by up to 6.5% compared to a parameter-matched language model. Our approach performs on par with speaker-specific fine-tuning when the fine-tuning data (i.e. past dialogue) for individual speakers is available. On top of that, it also generalises well to a scenario with no such data, relying on combinations of demographic characteristics expressed via metadata. Our findings are consistent across two corpora, one of which is also a contribution of this paper: Cornell-rich contains rich manual annotations for 863 speaking characters from the Cornell Movie Dialog Corpus, including features such as characteristic quotes and character descriptions, along with six automatically extracted metadata features for over 95% of the featured films. Finally, we also present a cost-benefit analysis highlighting which annotations are most cost-effective in reducing perplexity.
翻译:对上下文敏感的语言模型能够更有效地捕捉具有特定特征或在特定环境中的个体的说话模式。然而,获取并利用此类标注信息具有挑战性。在本研究中,我们展示了如何利用丰富的角色与电影标注信息,以可扩展的方式实现语言模型的个性化。我们的最优模型相较于参数匹配的语言模型,可将困惑度降低多达6.5%。当个体说话者的微调数据(即过往对话)可用时,本方法的性能与说话者特异性微调相当。此外,在缺乏此类数据的场景中,该方法依然能通过依赖人口统计学特征的元数据组合实现良好泛化。我们的发现两个语料库中保持一致,其中一个是本文的额外贡献:Cornell-rich包含来自康奈尔电影对话语料库中863个说话角色的丰富人工标注,包括特征性引语和角色描述等特征,以及针对超过95%影片的六种自动提取元数据特征。最后,我们还呈现一项成本效益分析,阐明哪些标注在降低困惑度方面最具成本效益。