Training and deploying machine learning models relies on a large amount of human-annotated data. As human labeling becomes increasingly expensive and time-consuming, recent research has developed multiple strategies to speed up annotation and reduce costs and human workload: generating synthetic training data, active learning, and hybrid labeling. This tutorial is oriented toward practical applications: we will present the basics of each strategy, highlight their benefits and limitations, and discuss in detail real-life case studies. Additionally, we will walk through best practices for managing human annotators and controlling the quality of the final dataset. The tutorial includes a hands-on workshop, where attendees will be guided in implementing a hybrid annotation setup. This tutorial is designed for NLP practitioners from both research and industry backgrounds who are involved in or interested in optimizing data labeling projects.
翻译:机器学习模型的训练与部署依赖于大量人工标注数据。随着人工标注成本日益高昂且耗时,近期研究提出了多种加速标注、降低成本和减轻人工负担的策略:生成合成训练数据、主动学习以及混合标注。本教程面向实际应用:我们将介绍每种策略的基础知识,突出其优势与局限,并详细讨论现实案例研究。此外,我们将逐步讲解管理人工标注员及控制最终数据集质量的最佳实践。本教程包含实践工作坊环节,参会者将在指导下实现混合标注流程搭建。本教程专为来自学术界与工业界、参与或关注数据标注项目优化的自然语言处理从业者设计。