THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science

The explosion of big social data has created a scalability trap for traditional qualitative research, as manual coding remains labor-intensive and conventional topic models often suffer from semantic thinning and a lack of domain awareness. This paper introduces Textual Hybrid Embedding based Topic Analysis (THETA), a novel computational paradigm and open-source tool designed to bridge the gap between massive data scale and rich theoretical depth. THETA moves beyond frequency-based statistics by implementing Domain-Adaptive Fine-tuning (DAFT) via LoRA on foundation embedding models, which effectively optimizes semantic vector structures within specific social contexts to capture latent meanings. To ensure epistemological rigor, we encapsulate this process into an AI Scientist Agent framework, comprising Data Steward, Modeling Analyst, and Domain Expert agents, to simulate the human-in-the-loop expert judgment and constant comparison processes central to grounded theory. Departing from purely computational models, this framework enables agents to iteratively evaluate algorithmic clusters, perform cross-topic semantic alignment, and refine raw outputs into logically consistent theoretical categories. To validate the effectiveness of THETA, we conducted experiments across six domains, including financial regulation and public health. Our results demonstrate that THETA significantly outperforms traditional models, such as LDA, ETM, and CTM, in capturing domain-specific interpretive constructs while maintaining superior coherence. By providing an interactive analysis platform, THETA democratizes advanced natural language processing for social scientists and ensures the trustworthiness and reproducibility of research findings. Code is available at https://github.com/CodeSoul-co/THETA.

翻译：大规模社交数据的爆炸式增长使传统定性研究陷入可扩展性困境，因为人工编码依然劳动密集，而传统主题模型常面临语义稀疏和领域意识缺失的问题。本文提出基于文本混合嵌入的主题分析（THETA），这是一种新颖的计算范式和开源工具，旨在弥合海量数据规模与丰富理论深度之间的鸿沟。THETA通过基于LoRA的领域自适应微调（DAFT）在基础嵌入模型上实施，有效优化特定社会语境中的语义向量结构以捕捉潜在含义，从而超越了基于频率的统计方法。为确保认识论严谨性，我们将该过程封装于AI科学家智能体框架中，该框架包含数据管理员、建模分析师和领域专家三类智能体，以模拟扎根理论中核心的"人在回路"专家判断与持续比较过程。区别于纯计算模型，该框架使智能体能够迭代评估算法聚类、执行跨主题语义对齐，并将原始输出提炼为逻辑一致的理论范畴。为验证THETA的有效性，我们在金融监管与公共卫生等六个领域开展实验。结果表明，在保持更优连贯性的同时，THETA在捕捉领域特定解释性构念方面显著优于LDA、ETM和CTM等传统模型。通过提供交互式分析平台，THETA为社会科学家普及了先进自然语言处理技术，并确保了研究结果的可信度与可复现性。代码发布于https://github.com/CodeSoul-co/THETA。