I recently shipped an AI assistant in my Curtain Estimator app that runs entirely on-device. No API calls, no cloud dependencies, complete privacy. Users can create jobs, search customers, and manage projects using natural language—all while completely offline.
This post walks through how I built it using llama.rn (React Native bindings for llama.cpp) and Qwen 1.7B, a surprisingly capable small language model.
The Privacy Problem with Cloud AI#
Most AI features in mobile apps work like this:
- User types a message
- App sends it to OpenAI/Anthropic/Google
- Response comes back
- Bill accumulates
For a curtain installation business app, this means:
- Customer names, addresses, phone numbers → sent to third parties
- Project details, pricing, notes → stored on external servers
- Compliance headaches (GDPR, data residency requirements)
- Monthly API bills scaling with usage
The alternative: run the AI model directly on the user’s phone.
Why It Actually Works Now#
A year ago, this would’ve been impractical. But three things changed:
1. Quantized models are tiny Qwen 1.7B with Q4_K_M quantization is just 1.1 GB—smaller than most games. It downloads once and lives in the app’s storage.
2. Mobile GPUs are fast llama.cpp leverages Metal (iOS) and Vulkan (Android) for GPU acceleration. On an iPhone 14 Pro, I get ~15 tokens/second—fast enough for real-time streaming.
3. Small models got smart Qwen 1.7B can follow structured instructions, parse JSON, and chain multi-step reasoning. Perfect for business logic, not creative writing.
Choosing the Right Model#
I tested four small models before settling on Qwen3-1.7B:
TinyLlama 1.1B (637 MB) Super fast, but struggles with structured output. It would frequently hallucinate customer IDs or forget required fields.
Phi-3-mini (1.8 GB) Strong reasoning, but way too verbose. Simple queries would generate 200+ word responses when I needed 20.
Gemma-2B (1.2 GB)
Fast and accurate for classification, but weak at function calling. It couldn’t reliably output the <action> tags my tool system needed.
Qwen3-1.7B (1.1 GB) ✅
Hit the sweet spot: reliable structured output, follows instructions precisely, supports chain-of-thought reasoning with <think> tags.
The Q4_K_M quantization uses 4-bit weights with k-means clustering—75% smaller than full precision with only ~5% quality loss.
Integration: llama.rn Setup#
The llama.rn library wraps llama.cpp for React Native. Installation is straightforward:
npm install llama.rn
cd ios && pod installThen configure the model download:
const MODEL_URL = "https://huggingface.co/unsloth/Qwen3-1.7B-GGUF/resolve/main/Qwen3-1.7B-Q4_K_M.gguf";
const MODEL_PATH = FileSystem.documentDirectory + "llama-models/Qwen3-1.7B-Q4_K_M.gguf";
const downloadModel = async () => {
const downloadResumable = FileSystem.createDownloadResumable(
MODEL_URL,
MODEL_PATH,
{},
(progress) => {
const pct = progress.totalBytesWritten / progress.totalBytesExpectedToWrite;
setDownloadProgress(pct);
}
);
await downloadResumable.downloadAsync();
};The model downloads on first use, with progress tracking. On WiFi, it takes 2-3 minutes. On LTE, maybe 5-8 minutes.
Once downloaded, load it with GPU acceleration:
const ctx = await initLlama({
model: MODEL_PATH,
n_ctx: 8192, // 8K context window
n_gpu_layers: 99, // Use GPU for all layers
});That n_gpu_layers: 99 is critical—it gives ~5x speedup by offloading computation to Metal/Vulkan instead of CPU.
Streaming Inference#
Users expect real-time responses, not a loading spinner for 10 seconds. llama.rn supports token-by-token streaming:
let fullResponse = "";
await llamaContext.completion(
{
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: "Create a job for John Smith" }
],
n_predict: 512,
temperature: 0.7,
top_p: 0.8,
},
(data) => {
// Called for each token
fullResponse += data.token;
setStreamingText(fullResponse);
}
);The UI updates in real-time as tokens stream in. On modern phones:
- First token: ~200ms (prompt processing)
- Subsequent tokens: ~50-80ms each
It feels instant compared to network-bound APIs.
Function Calling for Small Models#
GPT-4’s function calling uses JSON schemas. Small models struggle with this—they’ll output malformed JSON or miss required fields.
Instead, I use a simpler XML-based protocol:
const systemPrompt = `You are an AI assistant. When you need to take an action, output:
<action>{"type":"search_customers","query":"Smith"}</action>
Available actions:
- search_customers: {"type":"search_customers","query":"John"}
- create_job: {"type":"create_job","customer_id":5,"alias":"Living Room"}
- update_job: {"type":"update_job","job_id":"J-0042","status":"quoting"}
Rules:
1. ONE action tag per response, at the very end
2. When searching/creating → end with <action>
3. When just chatting → no action tag
`;Why XML tags instead of pure JSON?
- Clear delimiters:
<action>and</action>are unambiguous start/end markers - Easy parsing: Simple regex, no JSON schema validation
- Self-documenting: Examples show exact format
- Forgiving: Works even if the model adds extra text
When the model outputs an action, I parse it:
function parseAction(text: string): Action | null {
const match = text.match(/<action>([\s\S]*?)<\/action>/);
if (!match) return null;
try {
return JSON.parse(match[1].trim());
} catch {
return null;
}
}Then execute it and inject the result back into the conversation:
const action = parseAction(modelResponse);
if (action) {
const result = await executeAction(action);
// Inject result as a system message
conversationHistory.push({
role: "user",
content: `[TOOL_RESULT] ${result.message}\n\n→ NEXT: Tell the user what happened.`
});
// Continue generation
await generateNextResponse();
}This enables multi-step workflows. For example:
User: “Create a job for Smith”
- Model:
<action>{"type":"search_customers","query":"Smith"}</action> - System:
[TOOL_RESULT] Found: Jane Smith (id:42), Bob Smith (id:89) - Model: “I found two Smiths—Jane and Bob. Which one?”
- User: “Jane”
- Model:
<action>{"type":"create_job","customer_id":42}</action> - System:
[TOOL_RESULT] Job J-0073 created - Model: “Done! Created job J-0073 for Jane Smith.”
Chain-of-Thought with <think> Tags#
Small models benefit from explicit reasoning steps. Qwen supports thinking mode:
const systemPrompt = `Before each response, wrap your reasoning in <think> tags:
<think>
INTENT: what does the user want?
HAVE: what data do I already have?
NEED: what's still missing?
DECISION: call tool | ask user | just respond
</think>
Then output your actual response.`;The model’s internal reasoning:
<think>
INTENT: create a job
HAVE: customer query "Smith"
NEED: exact customer_id
DECISION: search first
</think>
<action>{"type":"search_customers","query":"Smith"}</action>I strip the <think> tags from the UI, but show them during development for debugging. It dramatically improves reliability—the model talks itself through the logic.
Handling Conversations and Context#
The 8K context window fills up fast with tool results. After ~30 turns, I hit the limit.
Solution: conversation compaction. I ask the model to summarize:
const compactPrompt = `Summarize this conversation in under 200 words.
Include: user's goal, customers/jobs created (with IDs), and any unfinished tasks.
${conversationHistory.map(m => `${m.role}: ${m.content}`).join('\n\n')}`;
const summary = await chat([{ role: "user", content: compactPrompt }]);
// Replace history with summary
conversationHistory = [
{ role: "user", content: `[CONTEXT SUMMARY]\n${summary}\n[END SUMMARY]` }
];A 50-message conversation compresses to ~150 tokens. The model can continue from the summary without losing track of created jobs or customer IDs.
Session Persistence#
Chat sessions are saved to my Django backend:
interface SessionMessage {
role: "user" | "assistant";
content: string; // Stripped for display
raw: string; // Full with <think> and <action> tags
}
await api.createAIChatSession({
organization_id: orgId,
title: firstUserMessage.slice(0, 60),
messages: [
{ role: "user", content: userText, raw: userText },
{ role: "assistant", content: cleanedResponse, raw: fullModelOutput }
]
});The dual storage (content vs raw) enables:
- UI replay: Show clean messages in chat history
- Exact restoration: Resume with full context including actions
Users can switch between conversations via a history dropdown, similar to ChatGPT.
Performance: Benchmarks from Production#
iPhone 14 Pro (A16 Bionic, 6GB RAM):
- Model load: ~3 seconds
- First token: 180-220ms
- Streaming: 14-16 tokens/sec
- Memory: ~1.8 GB
Samsung Galaxy S23 (Snapdragon 8 Gen 2, 8GB RAM):
- Model load: ~4 seconds
- First token: 250-300ms
- Streaming: 10-12 tokens/sec
- Memory: ~2.1 GB
iPhone 11 (A13, 4GB RAM):
- Model load: ~6 seconds
- First token: 400-500ms
- Streaming: 6-8 tokens/sec
- Memory: ~2.2 GB (occasionally crashes on low memory)
Recommendation: 6GB+ RAM for smooth experience. Works on 4GB but may need to unload the model when the app backgrounds.
Memory Management#
iOS will kill your app if it uses too much memory. Unload the model when backgrounded:
useEffect(() => {
const subscription = AppState.addEventListener("change", (nextAppState) => {
if (nextAppState === "background") {
llamaContext?.release(); // Free ~2 GB
} else if (nextAppState === "active") {
loadModel(); // Reload when foregrounded
}
});
return () => subscription.remove();
}, []);This prevents memory warnings and keeps your app responsive.
Battery Impact#
Running inference on-device uses power. Real-world testing:
| Usage Pattern | Extra Battery Drain |
|---|---|
| Light (5-10 queries/day) | <1% per day |
| Medium (20-30 queries/day) | ~3-4% per day |
| Heavy (50+ queries/day) | ~6-8% per day |
GPU acceleration is faster but uses more power than CPU. The trade-off is worth it—users prefer instant responses.
Real-World Results#
After two months in production with ~200 beta users:
Most common use cases:
- “Create a job for [customer name]” - 68% of queries
- “Find jobs for [customer]” - 15%
- “Update job [ID] to [status]” - 9%
- General questions about the app - 8%
Success rate: 91% of queries complete successfully on first try
Failure modes:
- Customer not found (user misspelled name) - 4%
- Context overflow (very long conversations) - 3%
- Hallucinated data (model made up customer IDs) - 2%
The hallucination issue improved dramatically after adding structured tool results with explicit customer_id=42 fields. Small models need this scaffolding.
Cost Comparison#
On-device (current):
- Infrastructure: $0/month
- Per-user cost: $0 (one-time 1.1 GB download)
- Scales to: unlimited users
Cloud alternative (estimated):
- Average conversation: ~2,000 tokens
- 200 users × 30 conversations/month = 6,000 conversations
- 6,000 × 2,000 = 12M tokens/month
- OpenAI GPT-3.5: $18/month
- OpenAI GPT-4: $360/month
- Anthropic Claude: $180/month
The on-device approach paid for itself immediately. Even at scale (10,000 users), the cost stays at $0.
Limitations to Know#
1. Model capabilities Qwen 1.7B is great at structured tasks, terrible at:
- Complex reasoning (multi-hop questions)
- Factual knowledge (it’s not a search engine)
- Creative content generation
- Nuanced language understanding
Design your features around what small models do well.
2. Device requirements Minimum: 3 GB RAM, 2 GB free storage Recommended: 6 GB RAM, 64-bit processor, GPU support
Older devices (iPhone 8, Galaxy S9) struggle. Consider feature detection and graceful fallback.
3. Model updates To update the model, users need to re-download 1.1 GB. I versioned the model in the storage path:
const MODEL_PATH = `${FileSystem.documentDirectory}llama-models/v2/Qwen3-1.7B-Q4_K_M.gguf`;When shipping a new model version, increment the path. Old models auto-delete on uninstall.
Platform Differences: iOS vs Android#
iOS (Metal):
- Faster inference (~20% faster than Android)
- Better memory management
- Model backed up to iCloud by default (can disable via
NSURLIsExcludedFromBackupKey)
Android (Vulkan/OpenCL):
- More variability across devices
- Some older GPUs don’t support Vulkan (fallback to CPU)
- Storage not backed up automatically
Test on both platforms—performance can vary significantly.
Debugging Tips#
1. Enable verbose logging
const ctx = await initLlama({
model: MODEL_PATH,
n_ctx: 8192,
n_gpu_layers: 99,
verbose: true, // Logs every token + timings
});2. Track token counts
const tokens = fullResponse.split(/\s+/).length * 1.3; // Rough estimate
console.log(`Generated ${tokens} tokens in ${duration}ms`);3. Monitor memory
import { MemoryInfo } from 'react-native-device-info';
const memoryUsage = await MemoryInfo.getUsedMemory();
console.log(`Memory: ${(memoryUsage / 1024 / 1024).toFixed(0)} MB`);4. Test offline
Enable airplane mode and verify everything works. The model should load from cache, inference should complete, and only backend persistence should fail gracefully.
What’s Next#
Multimodal support llama.cpp supports vision models (LLaVA, Qwen2-VL). Imagine: “Here’s a photo of the curtain fabric—add it to the job.”
Fine-tuning Train a LoRA adapter on domain-specific data (past job descriptions, customer interactions). Could improve accuracy by 20-30%.
Voice interface Combine with Whisper.cpp for on-device speech recognition. Fully offline voice assistant.
Federated learning Aggregate model improvements across users without sharing raw data. Privacy-preserving personalization.
Should You Build This?#
Good fit:
- Business apps with sensitive data
- Offline-first workflows
- Structured tasks (classification, data entry, search)
- Cost-sensitive at scale
Not a good fit:
- Open-ended chat (use Claude/GPT-4)
- Knowledge-intensive tasks (small models have limited training data)
- Low-end device support required
- Rapidly changing domain knowledge
Wrapping Up#
On-device AI gives you instant responses, offline functionality, and zero marginal cost. The privacy angle is a bonus.
The tech stack:
- llama.rn: React Native bindings for llama.cpp
- Qwen 1.7B Q4_K_M: Compact, instruction-following model
- Custom tool system: XML-based function calling
- Stream-first architecture: Token-by-token UI updates
Total implementation: ~12 hours of development, $0/month infrastructure cost, works offline, complete privacy.
If you’re building a business app with sensitive data, this is a practical option now. The tooling works, the performance is there, and the cost is hard to beat.
Questions? Feedback? I’m @jaredlynskey. The llama.rn library is at github.com/mybigday/llama.rn.

