Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter counts (7B+) and cloud-based inference, rendering them inaccessible to practitioners in resource-constrained environments and posing significant data sovereignty risks. This paper introduces Quecto-V1, a domain-specific Small Language Model (SLM) engineered to democratize access to Indian legal intelligence. Built upon a custom configuration of the GPT-2 architecture (124 million parameters), Quecto-V1 was trained from scratch exclusively on a corpus of Indian statutes, including the Indian Penal Code (IPC), the Code of Criminal Procedure (CrPC), and the Constitution of India. Unlike generalist models, which prioritize broad world knowledge, our approach maximizes "lexical density" within the legal domain. Furthermore, we address the deployment bottleneck by applying post-training 8-bit quantization (GGUF format), compressing the model to a memory footprint of under 150 MB. Our empirical analysis demonstrates that Quecto-V1 achieves high fidelity in retrieving statutory definitions and penal provisions, outperforming general-purpose SLMs in domain-specific exact match tasks while running entirely offline on consumer-grade CPUs. We further present an ablation study showing that 8-bit quantization yields a 74% reduction in model size with less than 3.5% degradation in retrieval accuracy compared to full-precision baselines. These findings suggest that for specialized, high-stakes domains like law, domain-specific training coupled with aggressive quantization offers a viable, privacy-preserving alternative to monolithic cloud models.

翻译：大型语言模型（LLMs）的快速扩散已经彻底改变了自然语言处理（NLP）领域，但同时也造成了“资源鸿沟”。最先进的法律智能系统通常依赖于海量参数（70亿以上）和基于云的推理，这使得资源受限环境中的从业者无法使用它们，并带来了显著的数据主权风险。本文介绍了Quecto-V1，这是一个专为普及印度法律智能访问而设计的领域特定小型语言模型（SLM）。Quecto-V1基于GPT-2架构的自定义配置（1.24亿参数）构建，并完全在印度法规语料库（包括《印度刑法典》、《刑事诉讼法典》和《印度宪法》）上从头开始训练。与优先考虑广泛世界知识的通用模型不同，我们的方法旨在最大化法律领域内的“词汇密度”。此外，我们通过应用训练后8位量化（GGUF格式）来解决部署瓶颈，将模型压缩至小于150 MB的内存占用。我们的实证分析表明，Quecto-V1在检索法定定义和刑罚条款方面实现了高保真度，在领域特定的精确匹配任务中优于通用SLM，同时完全在消费级CPU上离线运行。我们进一步提供了一项消融研究，表明与全精度基线相比，8位量化使模型大小减少了74%，而检索准确率下降不到3.5%。这些发现表明，对于法律这类专业化、高风险领域，领域特定训练与激进量化相结合，为庞大的云模型提供了一种可行且保护隐私的替代方案。