This paper addresses the problem of selecting of a set of texts for annotation in text classification using retrieval methods when there are limits on the number of annotations due to constraints on human resources. An additional challenge addressed is dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance. In our situation, where annotation occurs over a long time period, the selection of texts to be annotated can be made in batches, with previous annotations guiding the choice of the next set. To address these challenges, the paper proposes leveraging SHAP to construct a quality set of queries for Elasticsearch and semantic search, to try to identify optimal sets of texts for annotation that will help with class imbalance. The approach is tested on sets of cue texts describing possible future events, constructed by participants involved in studies aimed to help with the management of obesity and diabetes. We introduce an effective method for selecting a small set of texts for annotation and building high-quality classifiers. We integrate vector search, semantic search, and machine learning classifiers to yield a good solution. Our experiments demonstrate improved F1 scores for the minority classes in binary classification.
翻译:本文针对在标注数量受人力资源限制的情况下,利用检索方法选择文本进行分类标注的问题。额外挑战在于处理正例样本稀少的二分类问题,这反映了严重的类不平衡现象。在我们的场景中,标注工作需长期持续进行,文本选择可分批完成,并借鉴先前标注结果指导后续批次选择。为应对这些挑战,本文提出利用SHAP构建Elasticsearch与语义搜索的高质量查询集,旨在识别能缓解类不平衡问题的最优标注文本集。该方法在描述未来可能事件的提示文本集上进行了测试,这些文本由参与肥胖症与糖尿病管理研究的人员构建。我们提出了一种有效方法,用于选择少量文本进行标注并构建高质量分类器。通过集成向量搜索、语义搜索及机器学习分类器,我们获得了良好的解决方案。实验结果表明,在二分类任务中,少数类的F1分数得到显著提升。