Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.
翻译:在生物医学领域中,可靠的机器学习模型依赖于准确的标签,然而对生物医学时间序列数据进行标注仍然具有挑战性。算法化的样本选择可能有助于标注,但涉及真实人类标注者的研究证据尚不足。为此,我们比较了三种用于标注的样本选择方法:随机采样(RND)、最远优先遍历(FAFT)以及一种基于图形用户界面的方法,该方法允许探索高维数据的互补二维可视化(2DV)。我们在婴儿运动能力评估(IMA)和语音情感识别(SER)两个领域的四项分类任务中评估了这些方法。在有限标注预算下,12名标注者(分为专家和非专家)执行数据标注,并在标注后通过实验评估采样方法。在所有分类任务中,当聚合所有标注者的标签时,2DV表现最佳。在IMA任务中,2DV最有效地捕获了稀有类别,但由于有限的标注预算,其标注者间标签分布变异性也更大,导致基于个体标注者标签训练的模型分类性能下降;在这些情形下,FAFT表现出色。对于SER任务,2DV在专家标注者中优于其他方法,并在个体标注者设置下与非专家标注者的表现相当。失败风险分析显示,当标注者数量或标注者专业水平不确定时,RND是最安全的选择;而由于2DV具有更大的标签分布变异性,其风险最高。此外,实验后访谈表明,2DV使标注任务更有趣且更令人愉快。总体而言,基于2DV的采样方法在生物医学时间序列数据标注中具有前景,尤其是在标注预算并非极度受限的情况下。