Fine-tuning a 7B language model for specialized advising is attractive in resource-constrained settings, but multi-epoch runs routinely exceed the wall-clock limits of the free-tier GPUs (Kaggle, Colab) such users rely on. We report two things. First, a practical recipe: a three-epoch QLoRA fine-tune of Mistral-7B-Instruct-v0.3 (4-bit NF4, LoRA rank 16, via Unsloth) completed across two free-tier 16 GB GPUs (Tesla P100 then T4) by checkpointing only the small LoRA adapter (41.9M parameters) and resuming on the second machine. Adapter-only handoff is sufficient -- optimizer and scheduler state need not be transferred -- so the binding constraint is per-step VRAM and per-session wall-clock, not aggregate compute. Second, and more importantly, an honest evaluation that returns a cautionary result. On a blind held-out comparison against the un-fine-tuned base model, the fine-tuned model scored higher on similarity to the synthetic training distribution (BERTScore F1 +0.063, a fidelity not quality signal) but lower on advising quality: a blind LLM-as-judge preferred the base model on 46% of prompts versus 18%, and a source-verified factuality audit found four confident errors from the fine-tuned model on policy-sensitive topics against zero for the base. Auditing the training data with the same method, we find this is not a fine-tuning artifact: each audited error is already present in the Gemini-generated training answers, and a random-sample audit finds verifiable errors in a sizable fraction of responses (28-40%; single-judge, n=40). The data is therefore sufficient to account for the errors, which we attribute to the synthetic-data pipeline rather than the adapter-handoff method. We release the dataset, adapter, cross-GPU notebooks, and full evaluation harness so every result reproduces on a single 16 GB GPU.
翻译:在资源受限场景下,针对特定领域微调7B语言模型构建顾问系统颇具吸引力,但多轮次运行常超出用户依赖的免费GPU(Kaggle、Colab)计算时长限制。本文报告两则发现。其一,提出实用方案:通过仅检查点小型LoRA适配器(4190万参数)并在第二台机器恢复训练,在两张免费16GB GPU(Tesla P100与T4)上完成Mistral-7B-Instruct-v0.3的三轮次QLoRA微调(4-bit NF4量化,LoRA秩16,基于Unsloth框架)。仅适配器交接即可生效——无需转移优化器与调度器状态,因此关键约束在于单步VRAM消耗与会话时长,而非整体计算量。其二,更重要的警示性诚实评估结果。在盲测对比中,微调模型与未微调基模相比,在拟合合成训练数据分布方面得分更高(BERTScore F1 +0.063,属保真度而非质量信号),但顾问质量表现更差:盲测LLM裁判在46%提示词上偏好基模(微调模型仅18%),且来源验证的事实性审计发现微调模型在政策敏感话题上出现四次确定错误,而基模零错误。使用相同方法审计训练数据,发现此现象非微调产物:每个审计错误均存在于Gemini生成的训练答案中,随机抽样审计显示相当比例响应(28-40%;单一裁判,n=40)存在可验证错误。因此数据足以解释错误成因,我们将此归因于合成数据管线而非适配器交接方法。我们已公开数据集、适配器、跨GPU笔记本及完整评估框架,确保所有结果可在单张16GB GPU上复现。