THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science

The explosion of big social data has created a scalability trap for traditional qualitative research, as manual coding remains labor-intensive and conventional topic models often suffer from semantic thinning and a lack of domain awareness. This paper introduces Textual Hybrid Embedding based Topic Analysis (THETA), a novel computational paradigm and open-source tool designed to bridge the gap between massive data scale and rich theoretical depth. THETA moves beyond frequency-based statistics by implementing Domain-Adaptive Fine-tuning (DAFT) via LoRA on foundation embedding models, which effectively optimizes semantic vector structures within specific social contexts to capture latent meanings. To ensure epistemological rigor, we encapsulate this process into an AI Scientist Agent framework, comprising Data Steward, Modeling Analyst, and Domain Expert agents, to simulate the human-in-the-loop expert judgment and constant comparison processes central to grounded theory. Departing from purely computational models, this framework enables agents to iteratively evaluate algorithmic clusters, perform cross-topic semantic alignment, and refine raw outputs into logically consistent theoretical categories. To validate the effectiveness of THETA, we conducted experiments across six domains, including financial regulation and public health. Our results demonstrate that THETA significantly outperforms traditional models, such as LDA, ETM, and CTM, in capturing domain-specific interpretive constructs while maintaining superior coherence. By providing an interactive analysis platform, THETA democratizes advanced natural language processing for social scientists and ensures the trustworthiness and reproducibility of research findings. Code is available at https://github.com/CodeSoul-co/THETA.

翻译：大社交数据的爆炸式增长为传统定性研究创造了可扩展性陷阱，因为人工编码仍高度依赖人力，而传统主题模型往往存在语义稀疏性和领域认知缺失问题。本文提出基于文本混合嵌入的主题分析（THETA）这一新型计算范式与开源工具，旨在弥合海量数据规模与丰富理论深度之间的鸿沟。THETA突破基于频率的统计方法，通过在基础嵌入模型上采用LoRA进行领域自适应微调（DAFT），在特定社会语境中有效优化语义向量结构以捕捉潜在含义。为确保认识论严谨性，我们将该过程封装至包含数据管家、建模分析师和领域专家智能体的人工科学家智能体框架中，以模拟以人为中心的核心专家判断与持续比较过程，这正是扎根理论的核心。该框架突破纯计算模型局限，使智能体能够迭代评估算法聚类结果、执行跨主题语义对齐，并将原始输出精炼为逻辑一致的理论范畴。为验证THETA有效性，我们在金融监管与公共卫生等六个领域开展实验。结果表明，THETA在捕捉领域特定解释性构念方面显著优于LDA、ETM和CTM等传统模型，同时保持更高的一致性。通过提供交互式分析平台，THETA使得社会科学家能够以更普惠的方式利用先进自然语言处理技术，并确保研究发现的可靠性与可复现性。代码见https://github.com/CodeSoul-co/THETA。