This paper describes how we train BERT models to carry over a coding system developed on the paragraphs of a Hungarian literary journal to another. The aim of the coding system is to track trends in the perception of literary translation around the political transformation in 1989 in Hungary. To evaluate not only task performance but also the consistence of the annotation, moreover, to get better predictions from an ensemble, we use 10-fold crossvalidation. Extensive hyperparameter tuning is used to obtain the best possible results and fair comparisons. To handle label imbalance, we use loss functions and metrics robust to it. Evaluation of the effect of domain shift is carried out by sampling a test set from the target domain. We establish the sample size by estimating the bootstrapped confidence interval via simulations. This way, we show that our models can carry over one annotation system to the target domain. Comparisons are drawn to provide insights such as learning multilabel correlations and confidence penalty improve resistance to domain shift, and domain adaptation on OCR-ed text on another domain improves performance almost to the same extent as that on the corpus under study. See our code at https://codeberg.org/zsamboki/bert-annotator-ensemble.
翻译:本文描述了如何训练BERT模型,将一个匈牙利文学期刊段落上开发的编码系统迁移至另一期刊。该编码系统的目标是追踪1989年匈牙利政治转型前后文学翻译感知的趋势。为评估任务性能及标注一致性,并从集成中获得更优预测,我们采用10折交叉验证。通过广泛超参数调优以获得最佳结果和公平比较。针对标签不平衡问题,我们使用对此具有鲁棒性的损失函数和评估指标。通过从目标域采样测试集评估域偏移的影响。通过模拟估计自助法置信区间来确定样本量。借此,我们证明模型能够将一种标注系统迁移至目标域。对比分析揭示了以下见解:学习多标签关联和置信度惩罚可增强对域偏移的鲁棒性;对另一域的OCR文本进行域适应,其性能提升效果几乎与研究语料库上的域适应相当。代码见https://codeberg.org/zsamboki/bert-annotator-ensemble。