Programmatic weak supervision methodologies facilitate the expedited labeling of extensive datasets through the use of label functions (LFs) that encapsulate heuristic data sources. Nonetheless, the creation of precise LFs necessitates domain expertise and substantial endeavors. Recent advances in pre-trained language models (PLMs) have exhibited substantial potential across diverse tasks. However, the capacity of PLMs to autonomously formulate accurate LFs remains an underexplored domain. In this research, we address this gap by introducing DataSculpt, an interactive framework that harnesses PLMs for the automated generation of LFs. Within DataSculpt, we incorporate an array of prompting techniques, instance selection strategies, and LF filtration methods to explore the expansive design landscape. Ultimately, we conduct a thorough assessment of DataSculpt's performance on 12 real-world datasets, encompassing a range of tasks. This evaluation unveils both the strengths and limitations of contemporary PLMs in LF design.
翻译:程序化弱监督方法通过使用封装启发式数据源的标签函数(LFs)来加速对大规模数据集的标注。然而,创建精确的LFs需要领域知识和大量努力。预训练语言模型(PLMs)的最新进展已在各种任务中展现出巨大潜力。然而,PLMs能否自主设计准确的LFs仍是一个未充分探索的领域。在本研究中,我们通过引入DataSculpt(一个利用PLMs自动生成LFs的交互式框架)来填补这一空白。在DataSculpt中,我们整合了多种提示技术、实例选择策略和LF过滤方法,以探索广泛的设计空间。最终,我们在涵盖不同任务的12个真实数据集上对DataSculpt的性能进行了全面评估。该评估揭示了当前PLMs在LF设计中的优势与局限性。