This paper addresses the problem of selecting of a set of texts for annotation in text classification using retrieval methods when there are limits on the number of annotations due to constraints on human resources. An additional challenge addressed is dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance. In our situation, where annotation occurs over a long time period, the selection of texts to be annotated can be made in batches, with previous annotations guiding the choice of the next set. To address these challenges, the paper proposes leveraging SHAP to construct a quality set of queries for Elasticsearch and semantic search, to try to identify optimal sets of texts for annotation that will help with class imbalance. The approach is tested on sets of cue texts describing possible future events, constructed by participants involved in studies aimed to help with the management of obesity and diabetes. We introduce an effective method for selecting a small set of texts for annotation and building high-quality classifiers. We integrate vector search, semantic search, and machine learning classifiers to yield a good solution. Our experiments demonstrate improved F1 scores for the minority classes in binary classification.
翻译:本文针对在人力资源限制下无法进行大规模标注时,如何利用检索方法选择待标注文本集的问题展开研究。另一个挑战是处理二元分类中正样本数量极少所导致的严重类别不平衡现象。在我们的场景中,标注工作需持续较长时间,待标注文本可分批选取,且先前批次的标注结果将指导后续批次的选择。为解决上述问题,本文提出利用SHAP方法为Elasticsearch和语义搜索构建高质量的查询集,以识别最有利于缓解类别不平衡的待标注文本集。该方法在描述未来可能事件的线索文本数据集上进行测试,这些数据来自参与肥胖症和糖尿病管理研究的受试者构建的语料。我们提出了一种高效方法,既能选取少量文本进行标注,又能构建高性能分类器。通过整合向量搜索、语义搜索和机器学习分类器,我们获得了较优解决方案。实验结果表明,该方法能显著提升二元分类中少数类的F1分数。