In this paper, we aim to generate text classification data given arbitrary class definitions (i.e., user instruction), so one can train a small text classifier without any human annotation or raw corpus. Compared with pioneer attempts, our proposed Incubator is the first framework that can handle complicated and even mutually dependent classes (e.g., "TED Talk given by Educator" and "Other"). Specifically, Incubator is an LLM firstly tuned on the instruction-to-data mappings that we obtained from classification datasets and descriptions on HuggingFace together with in-context augmentation by GPT-4. We then refine Incubator by learning on the cluster centers of semantic textual embeddings to emphasize the uniformity and semantic diversity in generations. We compare Incubator on various classification tasks with strong baselines such as direct LLM-based inference and training data generation by prompt engineering. Experiments show Incubator is able to (1) perform well on traditional benchmarks, (2) take label dependency and user preference into consideration, and (3) enable logical text mining by incubating multiple classifiers.
翻译:本文旨在根据任意类别定义(即用户指令)生成文本分类数据,从而无需任何人工标注或原始语料即可训练小型文本分类器。与早期尝试相比,我们提出的Incubator是首个能够处理复杂甚至相互依赖类别(例如“教育工作者发表的TED演讲”与“其他”)的框架。具体而言,Incubator首先在大语言模型(LLM)上通过从HuggingFace上的分类数据集和描述以及GPT-4的上下文增强获得的指令-数据映射进行微调。随后,我们通过学习语义文本嵌入的聚类中心来优化Incubator,以强调生成结果的均匀性和语义多样性。我们在多种分类任务上将Incubator与强基线方法(如直接基于LLM的推理和通过提示工程生成训练数据)进行对比。实验表明,Incubator能够:(1)在传统基准测试上表现优异;(2)考虑标签依赖性和用户偏好;(3)通过孵化多个分类器实现逻辑文本挖掘。