Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.

翻译：在生物医学环境中，可靠的机器学习模型依赖于准确的标签，但标注生物医学时间序列数据仍然具有挑战性。算法驱动的样本选择可能有助于标注，然而涉及真实人类标注者的研究证据却较为稀缺。因此，我们比较了三种用于标注的样本选择方法：随机采样（RND）、最远优先遍历（FAFT）以及一种基于图形用户界面的方法，该方法能够探索高维数据的互补二维可视化（2DVs）。我们通过婴儿运动评估（IMA）和语音情感识别（SER）中的四个分类任务对这些方法进行了评估。在有限的标注预算下，十二位被划分为专家或非专家的标注者执行了数据标注，并在标注后进行了实验以评估采样方法。在所有分类任务中，当聚合标注者的标签时，2DV表现最佳。在IMA中，2DV最有效地捕获了稀有类别，但也因有限的标注预算导致标注者之间的标签分布变异性增大，当模型基于单个标注者的标签进行训练时，分类性能下降；在这些情况下，FAFT表现优异。对于SER，在专家标注者中，2DV优于其他方法，并在单个标注者设置中与非专家标注者的性能相当。失败风险分析表明，当标注者数量或标注者专业性不确定时，RND是最安全的选择，而2DV因标签分布变异性较大而具有最高风险。此外，实验后访谈显示，2DV使得标注任务更加有趣和愉快。总体而言，基于2DV的采样在生物医学时间序列数据标注中展现出前景，特别是在标注预算并非高度受限的情况下。