LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning

from arxiv, This version introduces a major architectural shift to Local LLMs and NLI-based assignment, scaling the framework to O(1) generative complexity. Formerly titled 'Question-Driven Analysis and Synthesis'

The discovery of deep, steerable taxonomies in large text corpora is currently restricted by a trade-off between the surface-level efficiency of topic models and the prohibitive, non-scalable assignment costs of LLM-integrated frameworks. We introduce \textbf{LogiPart}, a scalable, hypothesis-first framework for building interpretable hierarchical partitions that decouples hierarchy growth from expensive full-corpus LLM conditioning. LogiPart utilizes locally hosted LLMs on compact, embedding-aware samples to generate concise natural-language taxonomic predicates. These predicates are then evaluated efficiently across the entire corpus using zero-shot Natural Language Inference (NLI) combined with fast graph-based label propagation, achieving constant $O(1)$ generative token complexity per node relative to corpus size. We evaluate LogiPart across four diverse text corpora (totaling $\approx$140,000 documents). Using structured manifolds for \textbf{calibration}, we identify an empirical reasoning threshold at the 14B-parameter scale required for stable semantic grounding. On complex, high-entropy corpora (Wikipedia, US Bills), where traditional thematic metrics reveal an ``alignment gap,'' inverse logic validation confirms the stability of the induced logic, with individual taxonomic bisections maintaining an average per-node routing accuracy of up to 96\%. A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture. LogiPart enables frontier-level exploratory analysis on consumer-grade hardware, making hypothesis-driven taxonomic discovery feasible under realistic computational and governance constraints.

翻译：当前，在大型文本语料库中发现深度可控分类体系面临一个根本性权衡：主题模型的表面效率与集成大语言模型（LLM）框架所带来且难以扩展的昂贵标注成本。我们提出了 **LogiPart**，一个可扩展的、假设优先的框架，用于构建可解释的层次化划分，它将层次结构的生长与昂贵的全语料库LLM条件化解耦。LogiPart利用本地部署的LLM，在紧凑且嵌入感知的样本上生成简洁的自然语言分类谓词。然后，这些谓词通过结合零样本自然语言推理（NLI）与快速的基于图的标签传播技术，在整个语料库中进行高效评估，实现了相对于语料库规模、每个节点的生成令牌复杂度恒定为 $O(1)$。我们在四个不同的文本语料库（总计约 $\approx$140,000 份文档）上评估了LogiPart。通过使用结构化流形进行 **校准**，我们识别出实现稳定语义基础所需的经验推理阈值位于140亿参数规模。在复杂、高熵的语料库（如维基百科、美国法案）上，传统主题度量揭示了一个“对齐鸿沟”，而逆向逻辑验证则证实了所诱导逻辑的稳定性，单个分类二分法保持了高达96%的平均每节点路由准确率。一项由独立LLM作为评判者的定性审计确认了有意义的功能轴（如政策意图）的发现，这些是主题真实标签所无法捕捉的。LogiPart使得在消费级硬件上实现前沿水平的探索性分析成为可能，让在现实计算与治理约束下进行假设驱动的分类发现变得可行。