In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of the dataset size on the model performances (420 instead of 2013), and show that, while our models perform worse, they still offer a high enough precision and TPR, even without MLM, respectively 69% and 51%. Given the result, we provide an analysis of the attention mechanism as a supporting added value for humanists in order to produce more data.
翻译:本研究旨在评估深度学习方法在句子层面语义分类中的应用,以加速人文学科与语言学领域中传统且耗时的语料库构建过程。我们引入了一个新颖的语料库,包含约2500个句子(时间跨度从公元前300年至公元900年),涉及色情语义(医学、情色等)。我们评估了多种句子分类方法及不同输入嵌入层,结果表明所有方法均显著优于简单的基于词汇的搜索。我们探索了个人方言与社会方言元数据嵌入(世纪、作者、写作类型)的整合,但发现这会导致过拟合。实验结果验证了该方法的有效性:使用HAN模型分别实现了70.60%的高精确率和86.33%的真阳性率(TPR)。我们进一步评估了数据集规模对模型性能的影响(420条而非2013条),结果表明,尽管模型性能有所下降,但其精确率和TPR仍保持较高水平(即使不使用MLM,也分别达到69%和51%)。基于此结果,我们分析了注意力机制作为辅助工具为人文学者提供支持以生成更多数据的价值。