基于训练动态的医疗数据集诊断 (Diagnosing Medical Datasets with Training Dynamics)

This study explores the potential of using training dynamics as an automated alternative to human annotation for evaluating the quality of training data. The framework used is Data Maps, which classifies data points into categories such as easy-to-learn, hard-to-learn, and ambiguous (Swayamdipta et al., 2020). Swayamdipta et al. (2020) highlight that difficult-to-learn examples often contain errors, and ambiguous cases significantly impact model training. To confirm the reliability of these findings, we replicated the experiments using a challenging dataset, with a focus on medical question answering. In addition to text comprehension, this field requires the acquisition of detailed medical knowledge, which further complicates the task. A comprehensive evaluation was conducted to assess the feasibility and transferability of the Data Maps framework to the medical domain. The evaluation indicates that the framework is unsuitable for addressing datasets' unique challenges in answering medical questions.

翻译：本研究探讨了利用训练动态作为人工标注的自动化替代方案来评估训练数据质量的潜力。所采用的框架是Data Maps，该框架将数据点分类为易学习、难学习和模糊等类别（Swayamdipta等人，2020年）。Swayamdipta等人（2020年）指出，难以学习的样本通常包含错误，而模糊案例对模型训练具有显著影响。为验证这些发现的可靠性，我们使用具有挑战性的数据集复现了实验，重点关注医疗问答领域。该领域不仅需要文本理解能力，还要求掌握详细的医学知识，这进一步增加了任务的复杂性。我们进行了全面评估，以检验Data Maps框架在医疗领域的可行性和可迁移性。评估结果表明，该框架不适用于解决医疗问答数据集中特有的挑战。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日