As language models become capable of processing increasingly long and complex texts, there has been growing interest in their application within computational literary studies. However, evaluating the usefulness of these models for such tasks remains challenging due to the cost of fine-grained annotation for long-form texts and the data contamination concerns inherent in using public-domain literature. Current embedding similarity datasets are not suitable for evaluating literary-domain tasks because of a focus on coarse-grained similarity and primarily on very short text. We assemble and release FICSIM, a dataset of long-form, recently written fiction, including scores along 12 axes of similarity informed by author-produced metadata and validated by digital humanities scholars. We evaluate a suite of embedding models on this task, demonstrating a tendency across models to focus on surface-level features over semantic categories that would be useful for computational literary studies tasks. Throughout our data-collection process, we prioritize author agency and rely on continual, informed author consent.
翻译:随着语言模型处理日益冗长复杂文本的能力不断增强,其在计算文学研究领域的应用引起了广泛关注。然而,由于长篇文本细粒度标注的高成本以及使用公共领域文献时固有的数据污染问题,评估这些模型在此类任务中的实用性仍面临挑战。现有的嵌入相似性数据集因侧重于粗粒度相似性且主要针对极短文本,并不适用于文学领域任务的评估。我们构建并发布了FICSIM数据集,该数据集包含近期创作的长篇小说,并依据作者提供的元数据、经数字人文学者验证后,沿12个相似性维度进行评分。我们在此任务上评估了一系列嵌入模型,发现模型普遍倾向于关注表层特征,而非对计算文学研究任务具有价值的语义类别。在整个数据收集过程中,我们优先保障作者自主权,并始终依赖作者在知情情况下的持续同意。