彭罗斯铺砌低秩压缩与分段问答微调：面向领域特定大语言模型适配的通用框架 (Penrose Tiled Low-Rank Compression and Section-Wise Q&A Fine-Tuning: A General Framework for Domain-Specific Large Language Model Adaptation)

Large language models (LLMs) hold great promise for specialized scientific domains such as materials science, yet adapting them efficiently and accurately to domain-specific knowledge remains challenging due to limited data and high knowledge density. We propose a two-stage framework that combines structured model compression with a scientific fine-tuning regimen to address this challenge. In the compression stage, we decompose the LLM's weight matrices into local low-rank "rank blocks" and arrange these blocks in a Penrose-like non-periodic tiling pattern. Each block is then compacted via spectral transformations (e.g., discrete cosine or Fourier transforms), and a Kullback-Leibler (KL) divergence-based alignment loss preserves the distributional similarity between the compressed model's representations and those of the original full model. In the adaptation stage, the compressed model is further tuned using a human-like scientific reading protocol: it processes technical materials science documents section by section, engaging in a structured question-and-answer routine for each section. This section-wise Q&A fine-tuning strategy extracts explicit reasoning traces and gradually injects domain knowledge, while minimizing catastrophic forgetting of the model's general language capabilities. By balancing efficient compression with targeted adaptation, our two-stage approach enables precise specialization of LLMs to high-value domains under data-scarce conditions. We present this principled yet exploratory pipeline and outline its potential for advancing materials science knowledge integration, laying the groundwork for comprehensive empirical evaluation in future work.

翻译：大语言模型（LLM）在材料科学等专业科学领域展现出巨大潜力，但由于数据有限且知识密度高，如何高效、准确地将模型适配至领域特定知识仍具挑战。本文提出一个两阶段框架，通过结合结构化模型压缩与科学微调方案应对这一挑战。在压缩阶段，我们将LLM的权重矩阵分解为局部低秩的“秩块”，并以彭罗斯式非周期铺砌模式排列这些块。随后，每个块通过谱变换（如离散余弦变换或傅里叶变换）进行压缩，并基于Kullback-Leibler（KL）散度的对齐损失保持压缩模型表示与原始完整模型表示之间的分布相似性。在适配阶段，压缩模型通过类人科学阅读协议进一步调优：该模型按节处理材料科学技术文档，并对每一节执行结构化问答流程。这种分段问答微调策略能够提取显式推理轨迹并逐步注入领域知识，同时最小化模型通用语言能力的灾难性遗忘。通过平衡高效压缩与定向适配，我们的两阶段方法能够在数据稀缺条件下实现LLM对高价值领域的精准专业化。本文阐述了这一原则性且具探索性的流程，并展望了其在推进材料科学知识整合方面的潜力，为未来工作的全面实证评估奠定基础。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日