On-Device AI in React Native: Running Qwen 1.7B with llama.cpp

Table of Contents

I recently shipped an AI assistant in my Curtain Estimator app that runs entirely on-device. No API calls, no cloud dependencies, complete privacy. Users can create jobs, search customers, and manage projects using natural language—all while completely offline.

This post walks through how I built it using llama.rn (React Native bindings for llama.cpp) and Qwen 1.7B, a surprisingly capable small language model.

📝 Update: This implementation now uses Qwen3.5-2B-GGUF instead of the original Qwen3-1.7B. Additionally, I’ve switched from using the /no_think message hack to properly disabling thinking mode via the chat_template_kwargs parameter:
chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-27B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    },
)
This provides cleaner control over the model’s thinking behavior without prompt hacks.

The Privacy Problem with Cloud AI
#

Most AI features in mobile apps work like this:

User types a message
App sends it to OpenAI/Anthropic/Google
Response comes back
Bill accumulates

For a curtain installation business app, this means:

Customer names, addresses, phone numbers → sent to third parties
Project details, pricing, notes → stored on external servers
Compliance headaches (GDPR, data residency requirements)
Monthly API bills scaling with usage

The alternative: run the AI model directly on the user’s phone.

Why It Actually Works Now
#

A year ago, this would’ve been impractical. But three things changed:

1. Quantized models are tiny Qwen 1.7B with Q4_K_M quantization is just 1.1 GB—smaller than most games. It downloads once and lives in the app’s storage.

2. Mobile GPUs are fast llama.cpp leverages Metal (iOS) and Vulkan (Android) for GPU acceleration. On an iPhone 14 Pro, I get ~15 tokens/second—fast enough for real-time streaming.

3. Small models got smart Qwen 1.7B can follow structured instructions, parse JSON, and chain multi-step reasoning. Perfect for business logic, not creative writing.

Choosing the Right Model
#

I tested four small models before settling on Qwen3-1.7B:

TinyLlama 1.1B (637 MB) Super fast, but struggles with structured output. It would frequently hallucinate customer IDs or forget required fields.

Phi-3-mini (1.8 GB) Strong reasoning, but way too verbose. Simple queries would generate 200+ word responses when I needed 20.

Gemma-2B (1.2 GB) Fast and accurate for classification, but weak at function calling. It couldn’t reliably output the <action> tags my tool system needed.

Qwen3-1.7B (1.1 GB) ✅ Hit the sweet spot: reliable structured output, follows instructions precisely, supports chain-of-thought reasoning with <think> tags.

The Q4_K_M quantization uses 4-bit weights with k-means clustering—75% smaller than full precision with only ~5% quality loss.

Integration: llama.rn Setup
#

The llama.rn library wraps llama.cpp for React Native. Installation is straightforward:

npm install llama.rn
cd ios && pod install

Then configure the model download:

const MODEL_URL = "https://huggingface.co/unsloth/Qwen3-1.7B-GGUF/resolve/main/Qwen3-1.7B-Q4_K_M.gguf";
const MODEL_PATH = FileSystem.documentDirectory + "llama-models/Qwen3-1.7B-Q4_K_M.gguf";

const downloadModel = async () => {
  const downloadResumable = FileSystem.createDownloadResumable(
    MODEL_URL,
    MODEL_PATH,
    {},
    (progress) => {
      const pct = progress.totalBytesWritten / progress.totalBytesExpectedToWrite;
      setDownloadProgress(pct);
    }
  );

  await downloadResumable.downloadAsync();
};

The model downloads on first use, with progress tracking. On WiFi, it takes 2-3 minutes. On LTE, maybe 5-8 minutes.

Once downloaded, load it with GPU acceleration:

const ctx = await initLlama({
  model: MODEL_PATH,
  n_ctx: 8192,      // 8K context window
  n_gpu_layers: 99, // Use GPU for all layers
});

That n_gpu_layers: 99 is critical—it gives ~5x speedup by offloading computation to Metal/Vulkan instead of CPU.

Streaming Inference
#

Users expect real-time responses, not a loading spinner for 10 seconds. llama.rn supports token-by-token streaming:

let fullResponse = "";

await llamaContext.completion(
  {
    messages: [
      { role: "system", content: systemPrompt },
      { role: "user", content: "Create a job for John Smith" }
    ],
    n_predict: 512,
    temperature: 0.7,
    top_p: 0.8,
  },
  (data) => {
    // Called for each token
    fullResponse += data.token;
    setStreamingText(fullResponse);
  }
);

The UI updates in real-time as tokens stream in. On modern phones:

First token: ~200ms (prompt processing)
Subsequent tokens: ~50-80ms each

It feels instant compared to network-bound APIs.

Function Calling for Small Models
#

GPT-4’s function calling uses JSON schemas. Small models struggle with this—they’ll output malformed JSON or miss required fields.

Instead, I use a simpler XML-based protocol:

const systemPrompt = `You are an AI assistant. When you need to take an action, output:
<action>{"type":"search_customers","query":"Smith"}</action>

Available actions:
- search_customers: {"type":"search_customers","query":"John"}
- create_job: {"type":"create_job","customer_id":5,"alias":"Living Room"}
- update_job: {"type":"update_job","job_id":"J-0042","status":"quoting"}

Rules:
1. ONE action tag per response, at the very end
2. When searching/creating → end with <action>
3. When just chatting → no action tag
`;

Why XML tags instead of pure JSON?

Clear delimiters: <action> and </action> are unambiguous start/end markers
Easy parsing: Simple regex, no JSON schema validation
Self-documenting: Examples show exact format
Forgiving: Works even if the model adds extra text

When the model outputs an action, I parse it:

function parseAction(text: string): Action | null {
  const match = text.match(/<action>([\s\S]*?)<\/action>/);
  if (!match) return null;

  try {
    return JSON.parse(match[1].trim());
  } catch {
    return null;
  }
}

Then execute it and inject the result back into the conversation:

const action = parseAction(modelResponse);
if (action) {
  const result = await executeAction(action);

  // Inject result as a system message
  conversationHistory.push({
    role: "user",
    content: `[TOOL_RESULT] ${result.message}\n\n→ NEXT: Tell the user what happened.`
  });

  // Continue generation
  await generateNextResponse();
}

This enables multi-step workflows. For example:

User: “Create a job for Smith”

Model: <action>{"type":"search_customers","query":"Smith"}</action>
System: [TOOL_RESULT] Found: Jane Smith (id:42), Bob Smith (id:89)
Model: “I found two Smiths—Jane and Bob. Which one?”
User: “Jane”
Model: <action>{"type":"create_job","customer_id":42}</action>
System: [TOOL_RESULT] Job J-0073 created
Model: “Done! Created job J-0073 for Jane Smith.”

Chain-of-Thought with <think> Tags
#

Small models benefit from explicit reasoning steps. Qwen supports thinking mode:

const systemPrompt = `Before each response, wrap your reasoning in <think> tags:

<think>
INTENT: what does the user want?
HAVE: what data do I already have?
NEED: what's still missing?
DECISION: call tool | ask user | just respond
</think>

Then output your actual response.`;

The model’s internal reasoning:

<think>
INTENT: create a job
HAVE: customer query "Smith"
NEED: exact customer_id
DECISION: search first
</think>
<action>{"type":"search_customers","query":"Smith"}</action>

I strip the <think> tags from the UI, but show them during development for debugging. It dramatically improves reliability—the model talks itself through the logic.

Handling Conversations and Context
#

The 8K context window fills up fast with tool results. After ~30 turns, I hit the limit.

Solution: conversation compaction. I ask the model to summarize:

const compactPrompt = `Summarize this conversation in under 200 words.
Include: user's goal, customers/jobs created (with IDs), and any unfinished tasks.

${conversationHistory.map(m => `${m.role}: ${m.content}`).join('\n\n')}`;

const summary = await chat([{ role: "user", content: compactPrompt }]);

// Replace history with summary
conversationHistory = [
  { role: "user", content: `[CONTEXT SUMMARY]\n${summary}\n[END SUMMARY]` }
];

A 50-message conversation compresses to ~150 tokens. The model can continue from the summary without losing track of created jobs or customer IDs.

Session Persistence
#

Chat sessions are saved to my Django backend:

interface SessionMessage {
  role: "user" | "assistant";
  content: string;  // Stripped for display
  raw: string;      // Full with <think> and <action> tags
}

await api.createAIChatSession({
  organization_id: orgId,
  title: firstUserMessage.slice(0, 60),
  messages: [
    { role: "user", content: userText, raw: userText },
    { role: "assistant", content: cleanedResponse, raw: fullModelOutput }
  ]
});

The dual storage (content vs raw) enables:

UI replay: Show clean messages in chat history
Exact restoration: Resume with full context including actions

Users can switch between conversations via a history dropdown, similar to ChatGPT.

Performance: Benchmarks from Production
#

iPhone 14 Pro (A16 Bionic, 6GB RAM):

Model load: ~3 seconds
First token: 180-220ms
Streaming: 14-16 tokens/sec
Memory: ~1.8 GB

Samsung Galaxy S23 (Snapdragon 8 Gen 2, 8GB RAM):

Model load: ~4 seconds
First token: 250-300ms
Streaming: 10-12 tokens/sec
Memory: ~2.1 GB

iPhone 11 (A13, 4GB RAM):

Model load: ~6 seconds
First token: 400-500ms
Streaming: 6-8 tokens/sec
Memory: ~2.2 GB (occasionally crashes on low memory)

Recommendation: 6GB+ RAM for smooth experience. Works on 4GB but may need to unload the model when the app backgrounds.

Memory Management
#

iOS will kill your app if it uses too much memory. Unload the model when backgrounded:

useEffect(() => {
  const subscription = AppState.addEventListener("change", (nextAppState) => {
    if (nextAppState === "background") {
      llamaContext?.release(); // Free ~2 GB
    } else if (nextAppState === "active") {
      loadModel(); // Reload when foregrounded
    }
  });

  return () => subscription.remove();
}, []);

This prevents memory warnings and keeps your app responsive.

Battery Impact
#

Running inference on-device uses power. Real-world testing:

Usage Pattern	Extra Battery Drain
Light (5-10 queries/day)	<1% per day
Medium (20-30 queries/day)	~3-4% per day
Heavy (50+ queries/day)	~6-8% per day

GPU acceleration is faster but uses more power than CPU. The trade-off is worth it—users prefer instant responses.

Real-World Results
#

After two months in production with ~200 beta users:

Most common use cases:

“Create a job for [customer name]” - 68% of queries
“Find jobs for [customer]” - 15%
“Update job [ID] to [status]” - 9%
General questions about the app - 8%

Success rate: 91% of queries complete successfully on first try

Failure modes:

Customer not found (user misspelled name) - 4%
Context overflow (very long conversations) - 3%
Hallucinated data (model made up customer IDs) - 2%

The hallucination issue improved dramatically after adding structured tool results with explicit customer_id=42 fields. Small models need this scaffolding.

Cost Comparison
#

On-device (current):

Infrastructure: $0/month
Per-user cost: $0 (one-time 1.1 GB download)
Scales to: unlimited users

Cloud alternative (estimated):

Average conversation: ~2,000 tokens
200 users × 30 conversations/month = 6,000 conversations
6,000 × 2,000 = 12M tokens/month
OpenAI GPT-3.5: $18/month
OpenAI GPT-4: $360/month
Anthropic Claude: $180/month

The on-device approach paid for itself immediately. Even at scale (10,000 users), the cost stays at $0.

Limitations to Know
#

1. Model capabilities Qwen 1.7B is great at structured tasks, terrible at:

Complex reasoning (multi-hop questions)
Factual knowledge (it’s not a search engine)
Creative content generation
Nuanced language understanding

Design your features around what small models do well.

2. Device requirements Minimum: 3 GB RAM, 2 GB free storage Recommended: 6 GB RAM, 64-bit processor, GPU support

Older devices (iPhone 8, Galaxy S9) struggle. Consider feature detection and graceful fallback.

3. Model updates To update the model, users need to re-download 1.1 GB. I versioned the model in the storage path:

const MODEL_PATH = `${FileSystem.documentDirectory}llama-models/v2/Qwen3-1.7B-Q4_K_M.gguf`;

When shipping a new model version, increment the path. Old models auto-delete on uninstall.

Platform Differences: iOS vs Android
#

iOS (Metal):

Faster inference (~20% faster than Android)
Better memory management
Model backed up to iCloud by default (can disable via NSURLIsExcludedFromBackupKey)

Android (Vulkan/OpenCL):

More variability across devices
Some older GPUs don’t support Vulkan (fallback to CPU)
Storage not backed up automatically

Test on both platforms—performance can vary significantly.

Debugging Tips
#

1. Enable verbose logging

const ctx = await initLlama({
  model: MODEL_PATH,
  n_ctx: 8192,
  n_gpu_layers: 99,
  verbose: true, // Logs every token + timings
});

2. Track token counts

const tokens = fullResponse.split(/\s+/).length * 1.3; // Rough estimate
console.log(`Generated ${tokens} tokens in ${duration}ms`);

3. Monitor memory

import { MemoryInfo } from 'react-native-device-info';

const memoryUsage = await MemoryInfo.getUsedMemory();
console.log(`Memory: ${(memoryUsage / 1024 / 1024).toFixed(0)} MB`);

4. Test offline

Enable airplane mode and verify everything works. The model should load from cache, inference should complete, and only backend persistence should fail gracefully.

What’s Next
#

Multimodal support llama.cpp supports vision models (LLaVA, Qwen2-VL). Imagine: “Here’s a photo of the curtain fabric—add it to the job.”

Fine-tuning Train a LoRA adapter on domain-specific data (past job descriptions, customer interactions). Could improve accuracy by 20-30%.

Voice interface Combine with Whisper.cpp for on-device speech recognition. Fully offline voice assistant.

Federated learning Aggregate model improvements across users without sharing raw data. Privacy-preserving personalization.

Should You Build This?
#

Good fit:

Business apps with sensitive data
Offline-first workflows
Structured tasks (classification, data entry, search)
Cost-sensitive at scale

Not a good fit:

Open-ended chat (use Claude/GPT-4)
Knowledge-intensive tasks (small models have limited training data)
Low-end device support required
Rapidly changing domain knowledge

Wrapping Up
#

On-device AI gives you instant responses, offline functionality, and zero marginal cost. The privacy angle is a bonus.

The tech stack:

llama.rn: React Native bindings for llama.cpp
Qwen 1.7B Q4_K_M: Compact, instruction-following model
Custom tool system: XML-based function calling
Stream-first architecture: Token-by-token UI updates

Total implementation: ~12 hours of development, $0/month infrastructure cost, works offline, complete privacy.

If you’re building a business app with sensitive data, this is a practical option now. The tooling works, the performance is there, and the cost is hard to beat.

Questions? Feedback? I’m @jaredlynskey. The llama.rn library is at github.com/mybigday/llama.rn.

The Privacy Problem with Cloud AI#

Why It Actually Works Now#

Choosing the Right Model#

Integration: llama.rn Setup#

Streaming Inference#

Function Calling for Small Models#

Chain-of-Thought with <think> Tags#

Handling Conversations and Context#

Session Persistence#

Performance: Benchmarks from Production#

Memory Management#

Battery Impact#

Real-World Results#

Cost Comparison#

Limitations to Know#

Platform Differences: iOS vs Android#

Debugging Tips#

What’s Next#

Should You Build This?#

Wrapping Up#