In this project, we have investigated the use of advanced machine learning methods, specifically fine-tuned large language models, for pre-annotating data for a lexical extension task, namely adding descriptive words (verbs) to an existing (but incomplete, as of yet) ontology of event types. Several research questions have been focused on, from the investigation of a possible heuristics to provide at least hints to annotators which verbs to include and which are outside the current version of the ontology, to the possible use of the automatic scores to help the annotators to be more efficient in finding a threshold for identifying verbs that cannot be assigned to any existing class and therefore they are to be used as seeds for a new class. We have also carefully examined the correlation of the automatic scores with the human annotation. While the correlation turned out to be strong, its influence on the annotation proper is modest due to its near linearity, even though the mere fact of such pre-annotation leads to relatively short annotation times.
翻译:在本项目中,我们研究了利用先进机器学习方法(特别是微调大语言模型)对词汇扩展任务进行数据预标注,即为一个现有(但尚不完整)的事件类型本体添加描述性词汇(动词)。研究聚焦于多个课题,从探索一种可为标注者提供至少如何筛选动词的启发式方法(即哪些动词应纳入当前版本本体,哪些应排除在外),到可能利用自动评分帮助标注者更高效地确定阈值,以识别那些无法归入任何现有类别、从而需作为新类别种子的动词。我们还仔细考察了自动评分与人工标注之间的相关性。尽管相关性较强,但由于其近乎线性关系,对实际标注过程的影响有限,不过预标注本身确实能显著缩短标注时间。