Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application

On-device Small Language Models (SLMs) promise fully offline, private AI experiences for mobile users (no cloud dependency, no data leaving the device). But is this promise achievable in practice? This paper presents a longitudinal practitioner case study documenting the engineering challenges of integrating SLMs (Gemma 4 E2B, 2.6B parameters; Qwen3 0.6B, 600M parameters) into Palabrita, a production Android word-guessing game. Over a 5-day development sprint comprising 204 commits (~90 directly AI-related), the system underwent a radical transformation: from an ambitious design where the LLM generated complete structured puzzles (word, category, difficulty, and five hints as JSON) to a pragmatic architecture where curated word lists provide the words and the LLM generates only three short hints, with a deterministic fallback if it fails. We identify five categories of failures specific to on-device SLM integration: output format violations, constraint violations, context quality degradation, latency incompatibility, and model selection instability. For each failure category, we document the observed symptoms, root causes, and the prompt engineering and architectural strategies that effectively mitigated them, including multi-layer defensive parsing, contextual retry with failure feedback, session rotation, progressive prompt hardening, and systematic responsibility reduction. Our findings demonstrate that on-device SLMs are viable for production mobile applications, but only when the developer accepts a fundamental constraint: the most reliable on-device LLM feature is one where the LLM does the least. We distill our experience into eight actionable design heuristics for practitioners integrating SLMs into mobile apps.

翻译：端侧小语言模型（SLMs）有望为移动用户提供完全离线、私密的AI体验（无需依赖云端，数据无需离开设备）。然而，这一承诺在实践中能否实现？本文通过一项纵向实践者案例研究，记录了将SLMs（Gemma 4 E2B，26亿参数；Qwen3 0.6B，6亿参数）集成到Palabrita（一款生产级Android猜词游戏）中的工程挑战。在为期五天的开发冲刺中（包含204次提交，其中约90次直接与AI相关），系统经历了根本性转变：从最初由LLM生成完整结构化谜题（包括单词、类别、难度及五个提示，以JSON格式输出）的雄心勃勃设计，演变为一种务实架构——即由精选词库提供单词，LLM仅生成三个简短提示，并在失败时采用确定性回退方案。我们识别出五类端侧SLM集成特有的故障：输入格式违规、约束违反、上下文质量退化、延迟不兼容及模型选择不稳定。针对每类故障，我们记录了观察到的症状、根本原因，以及有效缓解这些问题的提示工程与架构策略，包括多层防御性解析、带失败反馈的上下文重试、会话轮换、渐进式提示硬化及系统性责任缩减。我们的发现表明，端侧SLM在生产级移动应用中可行，但前提是开发者必须接受一项基本约束：最可靠的端侧LLM功能，恰恰是让LLM承担最少任务的那一种。我们基于实践经验提炼出八条可操作的设计启发式原则，供将SLMs集成至移动应用的从业者参考。