Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these "unknown unknowns" is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate "unknown unknowns" in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment With Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.
翻译:数据增强对于提升机器学习模型的鲁棒性与安全性至关重要。然而,数据增强的实施面临挑战,因为它需要生成多样化的数据点以严格评估模型在边缘情况下的行为,并减轻潜在危害。创建能够覆盖这些"未知的未知"的高质量增强数据,是一项耗时且需要创造力的任务。本研究介绍了Amplio——一种交互式工具,旨在帮助从业者探索非结构化文本数据集中的"未知的未知",并通过系统识别待探索的空白数据区域来提升数据多样性。Amplio包含三种人在回路数据增强技术:基于概念的增强、插值增强以及基于大语言模型的增强。通过对18名专业红队成员开展用户研究,我们证明了所提出的增强方法在帮助生成高质量、多样化且相关的模型安全提示方面的有效性。研究发现,Amplio能够帮助红队成员快速且富有创造性地进行数据增强,这凸显了交互式增强工作流程的变革潜力。