Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.
翻译:数据分析人员长久以来致力于将非结构化文本数据转化为有意义的概念。尽管主题建模和聚类分析普遍应用,但它们聚焦于低层次关键词,且需要大量解释性工作。我们提出概念归纳这一计算方法,通过定义明确的包含标准,直接从非结构化文本中生成高层次概念。以有毒网络评论数据集为例,当先进BERTopic模型输出"女性、权力、女性"时,概念归纳可生成诸如"对传统性别角色的批评"和"对女性关切的漠视"等高层次概念。我们推出LLooM算法,该算法利用大型语言模型迭代综合采样文本,逐步提出更具普遍性且可被人类理解的概念。随后,我们将LLooM嵌入混合主动性文本分析工具,使分析人员能将注意力从解读主题转向理论驱动分析。通过技术评估及涵盖文献综述到内容审核的四类分析场景,我们发现LLooM生成的概念在质量和数据覆盖率上均优于现有主题模型技术。在专家案例研究中,LLooM帮助研究者从甚至熟悉的数据集中发掘新见解,例如在政治社交媒体数据集中,它识别出先前被忽视的"攻击外党立场"概念。