top of page

Building a Real On-Device Intent Classifier with a Small Language Model

  • Writer: arda doğantemur
    arda doğantemur
  • Feb 9
  • 6 min read

Most AI demos look impressive—until you ask a simple question:

“Would this actually work inside a real mobile app, on a real device, for a real user?”

I’ll walk through how I built a fully on-device intent classifier using a Small Language Model (SLM), why smaller models behave very differently than large ones, and what actually breaks when you try to ship this in production.


In most retail or marketplace apps, intent is inferred after the user does something: 

  • They browse → we react 

  • They abandon checkout → we send an email 

  • They search → we show results 


But the most valuable moment is much earlier. The search box. 

Traditional approaches solve this with: 

  • Keyword rules 

  • Regex 

  • Hard-coded taxonomies 


It does a really good job, don't get me wrong. But it can be much more powerful than this.  


What is the Idea?


Recently, a TV I bought from Amazon had a problem on shipping. Amazon’s support is great but finding it isn’t.


To get help, I had to:

- Open my orders

- Find the purchase

- Navigate multiple screens

- Finally reach support after several clicks


As a user, I already knew what I wanted: help. But the app treated me like a shopper.

I want to solve that problem. I wanted something that: 

  • Understands meaning, not keywords 

  • Runs fully on the device 

  • Responds in under a second 

  • Is deterministic enough for UX flows 


The idea is simple:

Upgrade the search box with a local Small Language Model.

Instead of treating every search as a shopping query, the search box should understand intent.


Why On Device , No Cloud?


This wasn’t a philosophical choice. It was a product constraint

I wanted: 

  • Zero network dependency 

  • No user data leaving the device 

  • Predictable latency 

  • No per-request cost 

  • Something that could run at scale 


This immediately rules out cloud LLMs. Instead, I focused on Small Language Models that can run locally using DataSapien SDK.


Model Comparison: What Actually Worked (and What Didn’t) 

Model 

Size 

Latency (after load) 

Observed Behavior 

Verdict 

Gemma-3-270m-it-q8_0 

~280MB 

 

~0.1–0.2s 

 

Struggled to follow rules consistently. Often ignored decision boundaries and failed to extract meaning from short queries. Outputs felt noisy and semantically shallow. 

❌ Too weak 

 

Qwen 2.5 – 0.5B 

~500 MB 

~0.1–0.2s 

Overfit to dominant patterns. Tended to collapse into a single label (e.g. always SALES or always SUPPORT). Struggled with semantic ambiguity. 

❌ Unstable 

 

Gemma 3n 4eb 

~4.5 GB 

~3.0–4.9s 

Good semantic understanding and relatively stable decisions. However, model size and on-device cost made it impractical for this use case. 

⚠️  

Good but expensive 

 

Qwen 2.5 – 3B  

~1.6–2 GB 

~0.2–0.7s 

Consistent binary decisions. Strong semantic separation. Followed minimal prompts well. Stable across repeated runs. 

✅ Selected 

 

*RAW inference results data can be shared upon request. 


Model Choice: Qwen 2.5 (3B, Q4_K_M) 


After testing multiple models, I settled on: 

const model = "qwen2.5-3b-instruct-q4_k_m"; 
  • ~3B parameters: small enough for mobile, large enough for semantics 

  • Q4_K_M quantization: good balance of size vs reasoning 

  • Stable instruction following 

  • Fast inference once loaded 


~0.2s – 0.7s per query 

That’s fast enough to redirect UI immediately after search. 


What POC looks like


How The Journey looks like



  • Model & Inference Parameters Box

  • If model is downloaded or not Box

  • Download model Box

  • Show Search Box

  • Prompt Box

  • Local LLM Inference Box

  • Show Sales or Support Box


Model & Inference Params Box


In an on-device setup, how you run the model matters almost as much as which model you pick. 

I separated configuration into two layers: 

  • Model params: affect runtime performance and memory (context size, batching, threading, GPU offload). 

  • Inference params: affect output behavior (randomness, determinism, response length). 

Here’s the exact config I ended up using after many tests: 


Model Params (performance & memory) 

JourneyContext.putValue("nCtx", 5000);
JourneyContext.putValue("batch_size", 128);
JourneyContext.putValue("nThreads", 4);
JourneyContext.putValue("nGpuLayers", 20);

nCtx = 5000 

I wanted enough context headroom to keep the prompt stable even as the UI flow grows (extra system text, future routing rules, optional metadata). For intent classification, the user query is tiny — but the guardrails aren’t. A larger context window prevents accidental truncation and makes the system resilient as the prompt evolves. 


batch_size = 128 

Batch size mainly affects throughput and how efficiently the runtime processes tokens. I picked 128 as a safe middle ground: large enough to keep inference snappy, but not so large that it spikes memory or causes instability on weaker devices. 

 

nThreads = 4 

On mobile, more threads isn’t always better — it can increase contention, heat, and battery drain. Four threads gave me consistent latency without pushing the device into thermal throttling territory. 

 

nGpuLayers = 20 

This was the “speed lever.” Offloading some layers to the GPU reduced latency dramatically after the model was loaded. I didn’t max this out because GPU offload also increases VRAM pressure and can cause unpredictable slowdowns depending on the device. 20 layers was a stable sweet spot: fast enough, but still reliable. 


Inference Params (behavior & determinism) 


JourneyContext.putValue("temperature", 0.0);
JourneyContext.putValue("topP", 1.0);
JourneyContext.putValue("topK", 1);
JourneyContext.putValue("max_tokens", 512);

temperature = 0.0 

For this feature, “creative” is a bug. I’m routing the user through the UI based on a single label — I want deterministic behavior, not variation. 


topK = 1 (hard deterministic) 

This forces the model to always pick the single most likely next token. Combined with temperature 0, it eliminates sampling randomness almost entirely. This was crucial because I observed “label drift” in smaller models across repeated runs — topK=1 helped lock the behavior down. 

 

topP = 1.0 

With topK already restricting choices, topP becomes less relevant, but leaving it at 1.0 avoids accidentally filtering probability mass twice. Think of it as “don’t add extra constraints unless needed.” 

 

max_tokens = 512 

This is intentionally generous. The model is supposed to output a single label, so the actual output is tiny — but allowing more tokens prevents edge cases where the runtime truncates mid-output due to internal formatting quirks. It’s a safety buffer, not a target. 

 

Model is downloaded or not Box


(async function() {
  try {
    const modelId = JourneyContext.getValue("chosenModel");
    if (!modelId) {
      JourneyContext.putValue("isModelDownloaded", "false");
      onSuccess(true);
      return;
    }

    const result = await IntelligenceService.isModelFilesDownloaded(modelId);
    JourneyContext.putValue("isModelDownloaded", result ? "true" : "false");
    onSuccess(true);
    
  } catch (error) {
    console.error("Error checking model download status:", error);
    JourneyContext.putValue("isModelDownloaded", "false");
    onSuccess(true);
  }
})();

Show Search Box



Just placed 1 search box and 1 button


Prompt Box


Here’s the exact prompt setup I used:

const userData = JourneyContext.getValue("textfield_gq51pv");

const system_prompt = `
You are a strict binary intent classifier.
Reply with exactly one label and nothing else.
`;

const usiPrompt = `
Classify the intent of the search query.

Definitions:
- SUPPORT_PRIORITY: help for a problem with something already owned or an order issue.
- SALES_NUDGE: shopping / discovery / comparing.

Decision:
If the query describes a problem or something not working → SUPPORT_PRIORITY.
Else → SALES_NUDGE.

Reply with exactly one label: SUPPORT_PRIORITY or SALES_NUDGE

Query:
${userData}
`;

JourneyContext.putValue("nbe_prompt", usiPrompt);
JourneyContext.putValue("system_prompt", system_prompt);

onSuccess(true);

The goal here isn’t to “hard-code” behavior or turn the model into a deterministic keyword matcher. This prompt is not a rule-engine in disguise. It’s a minimal contract: define the labels, define the boundary, and let the model use semantics.


It’s the opposite:


  • I’m not enumerating keywords like “broken / return / refund”.

  • I’m not stuffing the prompt with dozens of examples.

  • I’m only giving the model a clean intent boundary and a single output contract.


So the model stays free to interpret meaning (“I’m unhappy with what I bought” vs “I’m exploring options”), while still returning something that a UX flow can trust.


Where determinism comes from

Instead of over-instructing the model, I keep the prompt lightweight and let inference parameters do most of the heavy lifting:


  • very low temperature

  • tight sampling (topK/topP)

  • short max tokens


That combination is what makes the output behave like a reliable router, not a chatty assistant.


Production guardrail: grammar

In production, you can go one step further:

Even with strict prompts, smaller models sometimes “leak” extra text (or formatting).

So the final safety net is adding a grammar constraint (GBNF / regex-style output guardrail) to force outputs into exactly:

  • SUPPORT_PRIORITY

  • SALES_NUDGE


I didn’t need that for early testing, but it’s an easy and very practical hardening step.


LLM Invocation


(async function runOnce() { 
  try { 
    const chosenModel = JourneyContext.getValue("chosenModel"); 
    const systemPrompt = JourneyContext.getValue("system_prompt"); 
    const userPrompt = JourneyContext.getValue("nbe_prompt"); 

    // Model params 
    const modelParams = { 
      nCtx: JourneyContext.getValue("nCtx"),           // 5000 
      nBatchSize: JourneyContext.getValue("batch_size"), // 128 
      nThreads: JourneyContext.getValue("nThreads"),   // 4 
      nGpuLayers: JourneyContext.getValue("nGpuLayers") // 20 
    }; 

    // Inference params 
    const inferenceParams = { 
      temperature: JourneyContext.getValue("temperature"), // 0.0 
      topP: JourneyContext.getValue("topP"),               // 1.0 
      topK: JourneyContext.getValue("topK"),               // 1 
      maxTokens: JourneyContext.getValue("max_tokens")     // 512 
    }; 

    const prompts = [ 
      { role: "system", content: systemPrompt }, 
      { role: "user", content: userPrompt } 
    ]; 

    await IntelligenceService.loadModel(chosenModel, modelParams); 

    const result = await IntelligenceService.invokeModel(
      chosenModel,
      prompts,
      inferenceParams
    ); 

    const output = String(result || "").trim(); 

    JourneyContext.putValue("llm_result", output); 
    onSuccess(true); 

  } catch (e) { 
    tlog("ERROR: " + e); 
    onSuccess(false); 
  } 
})();

The model is loaded once and cached on-device. After that, every invocation runs locally and returns in ~200–700 ms on a modern phone.


This is why I was comfortable choosing a slightly larger model (qwen2.5-3b-instruct-q4_k_m): the download happens once, but inference speed stays fast.


Show Sales or Support Box


Once the intent is classified, the UI reacts instantly. If the result is SALES_NUDGE, we show the shopping flow; if it’s SUPPORT_PRIORITY, we skip everything and take the user straight to the support box.


What's Next?


With fine-tuning, smaller models can become more stable without sacrificing speed or privacy. This opens the door to UX decisions that run entirely on the device—no cloud calls, no tracking, no latency.

Comments


© 2020 by Arda Doğantemur.

bottom of page