Building a Real On-Device Intent Classifier with a Small Language Model

arda doğantemur
Feb 9
6 min read

Most AI demos look impressive—until you ask a simple question:

“Would this actually work inside a real mobile app, on a real device, for a real user?”

I’ll walk through how I built a fully on-device intent classifier using a Small Language Model (SLM), why smaller models behave very differently than large ones, and what actually breaks when you try to ship this in production.

In most retail or marketplace apps, intent is inferred after the user does something:

They browse → we react
They abandon checkout → we send an email
They search → we show results

But the most valuable moment is much earlier. The search box.

Traditional approaches solve this with:

Keyword rules
Regex
Hard-coded taxonomies

It does a really good job, don't get me wrong. But it can be much more powerful than this.

What is the Idea?

Recently, a TV I bought from Amazon had a problem on shipping. Amazon’s support is great but finding it isn’t.

To get help, I had to:

- Open my orders

- Find the purchase

- Navigate multiple screens

- Finally reach support after several clicks

As a user, I already knew what I wanted: help. But the app treated me like a shopper.

I want to solve that problem. I wanted something that:

Understands meaning, not keywords
Runs fully on the device
Responds in under a second
Is deterministic enough for UX flows

The idea is simple:

Upgrade the search box with a local Small Language Model.

Instead of treating every search as a shopping query, the search box should understand intent.

Why On Device , No Cloud?

This wasn’t a philosophical choice. It was a product constraint.

I wanted:

Zero network dependency
No user data leaving the device
Predictable latency
No per-request cost
Something that could run at scale

This immediately rules out cloud LLMs. Instead, I focused on Small Language Models that can run locally using DataSapien SDK.

Model Comparison: What Actually Worked (and What Didn’t)

Model	Size	Latency (after load)	Observed Behavior	Verdict
Gemma-3-270m-it-q8_0	~280MB	~0.1–0.2s	Struggled to follow rules consistently. Often ignored decision boundaries and failed to extract meaning from short queries. Outputs felt noisy and semantically shallow.	❌ Too weak
Qwen 2.5 – 0.5B	~500 MB	~0.1–0.2s	Overfit to dominant patterns. Tended to collapse into a single label (e.g. always SALES or always SUPPORT). Struggled with semantic ambiguity.	❌ Unstable
Gemma 3n 4eb	~4.5 GB	~3.0–4.9s	Good semantic understanding and relatively stable decisions. However, model size and on-device cost made it impractical for this use case.	⚠️ Good but expensive
Qwen 2.5 – 3B	~1.6–2 GB	~0.2–0.7s	Consistent binary decisions. Strong semantic separation. Followed minimal prompts well. Stable across repeated runs.	✅ Selected

*RAW inference results data can be shared upon request.

Model Choice: Qwen 2.5 (3B, Q4_K_M)

After testing multiple models, I settled on:

const model = "qwen2.5-3b-instruct-q4_k_m";

~3B parameters: small enough for mobile, large enough for semantics
Q4_K_M quantization: good balance of size vs reasoning
Stable instruction following
Fast inference once loaded

~0.2s – 0.7s per query

That’s fast enough to redirect UI immediately after search.

What POC looks like

How The Journey looks like

Model & Inference Parameters Box
If model is downloaded or not Box
Download model Box
Show Search Box
Prompt Box
Local LLM Inference Box
Show Sales or Support Box

Model & Inference Params Box

In an on-device setup, how you run the model matters almost as much as which model you pick.

I separated configuration into two layers:

Model params: affect runtime performance and memory (context size, batching, threading, GPU offload).
Inference params: affect output behavior (randomness, determinism, response length).

Here’s the exact config I ended up using after many tests:

Model Params (performance & memory)

JourneyContext.putValue("nCtx", 5000);
JourneyContext.putValue("batch_size", 128);
JourneyContext.putValue("nThreads", 4);
JourneyContext.putValue("nGpuLayers", 20);

nCtx = 5000

I wanted enough context headroom to keep the prompt stable even as the UI flow grows (extra system text, future routing rules, optional metadata). For intent classification, the user query is tiny — but the guardrails aren’t. A larger context window prevents accidental truncation and makes the system resilient as the prompt evolves.

batch_size = 128

Batch size mainly affects throughput and how efficiently the runtime processes tokens. I picked 128 as a safe middle ground: large enough to keep inference snappy, but not so large that it spikes memory or causes instability on weaker devices.

nThreads = 4

On mobile, more threads isn’t always better — it can increase contention, heat, and battery drain. Four threads gave me consistent latency without pushing the device into thermal throttling territory.

nGpuLayers = 20

This was the “speed lever.” Offloading some layers to the GPU reduced latency dramatically after the model was loaded. I didn’t max this out because GPU offload also increases VRAM pressure and can cause unpredictable slowdowns depending on the device. 20 layers was a stable sweet spot: fast enough, but still reliable.

Inference Params (behavior & determinism)

JourneyContext.putValue("temperature", 0.0);
JourneyContext.putValue("topP", 1.0);
JourneyContext.putValue("topK", 1);
JourneyContext.putValue("max_tokens", 512);

temperature = 0.0

For this feature, “creative” is a bug. I’m routing the user through the UI based on a single label — I want deterministic behavior, not variation.

topK = 1 (hard deterministic)

This forces the model to always pick the single most likely next token. Combined with temperature 0, it eliminates sampling randomness almost entirely. This was crucial because I observed “label drift” in smaller models across repeated runs — topK=1 helped lock the behavior down.

topP = 1.0

With topK already restricting choices, topP becomes less relevant, but leaving it at 1.0 avoids accidentally filtering probability mass twice. Think of it as “don’t add extra constraints unless needed.”

max_tokens = 512

This is intentionally generous. The model is supposed to output a single label, so the actual output is tiny — but allowing more tokens prevents edge cases where the runtime truncates mid-output due to internal formatting quirks. It’s a safety buffer, not a target.

Model is downloaded or not Box

(async function() {
  try {
    const modelId = JourneyContext.getValue("chosenModel");
    if (!modelId) {
      JourneyContext.putValue("isModelDownloaded", "false");
      onSuccess(true);
      return;
    }

    const result = await IntelligenceService.isModelFilesDownloaded(modelId);
    JourneyContext.putValue("isModelDownloaded", result ? "true" : "false");
    onSuccess(true);
    
  } catch (error) {
    console.error("Error checking model download status:", error);
    JourneyContext.putValue("isModelDownloaded", "false");
    onSuccess(true);
  }
})();

Show Search Box

Just placed 1 search box and 1 button

Prompt Box

Here’s the exact prompt setup I used:

const userData = JourneyContext.getValue("textfield_gq51pv");

const system_prompt = `
You are a strict binary intent classifier.
Reply with exactly one label and nothing else.
`;

const usiPrompt = `
Classify the intent of the search query.

Definitions:
- SUPPORT_PRIORITY: help for a problem with something already owned or an order issue.
- SALES_NUDGE: shopping / discovery / comparing.

Decision:
If the query describes a problem or something not working → SUPPORT_PRIORITY.
Else → SALES_NUDGE.

Reply with exactly one label: SUPPORT_PRIORITY or SALES_NUDGE

Query:
${userData}
`;

JourneyContext.putValue("nbe_prompt", usiPrompt);
JourneyContext.putValue("system_prompt", system_prompt);

onSuccess(true);

The goal here isn’t to “hard-code” behavior or turn the model into a deterministic keyword matcher. This prompt is not a rule-engine in disguise. It’s a minimal contract: define the labels, define the boundary, and let the model use semantics.

It’s the opposite:

I’m not enumerating keywords like “broken / return / refund”.
I’m not stuffing the prompt with dozens of examples.
I’m only giving the model a clean intent boundary and a single output contract.

So the model stays free to interpret meaning (“I’m unhappy with what I bought” vs “I’m exploring options”), while still returning something that a UX flow can trust.

Where determinism comes from

Instead of over-instructing the model, I keep the prompt lightweight and let inference parameters do most of the heavy lifting:

very low temperature
tight sampling (topK/topP)
short max tokens

That combination is what makes the output behave like a reliable router, not a chatty assistant.

Production guardrail: grammar

In production, you can go one step further:

Even with strict prompts, smaller models sometimes “leak” extra text (or formatting).

So the final safety net is adding a grammar constraint (GBNF / regex-style output guardrail) to force outputs into exactly:

SUPPORT_PRIORITY
SALES_NUDGE

I didn’t need that for early testing, but it’s an easy and very practical hardening step.

LLM Invocation

(async function runOnce() { 
  try { 
    const chosenModel = JourneyContext.getValue("chosenModel"); 
    const systemPrompt = JourneyContext.getValue("system_prompt"); 
    const userPrompt = JourneyContext.getValue("nbe_prompt"); 

    // Model params 
    const modelParams = { 
      nCtx: JourneyContext.getValue("nCtx"),           // 5000 
      nBatchSize: JourneyContext.getValue("batch_size"), // 128 
      nThreads: JourneyContext.getValue("nThreads"),   // 4 
      nGpuLayers: JourneyContext.getValue("nGpuLayers") // 20 
    }; 

    // Inference params 
    const inferenceParams = { 
      temperature: JourneyContext.getValue("temperature"), // 0.0 
      topP: JourneyContext.getValue("topP"),               // 1.0 
      topK: JourneyContext.getValue("topK"),               // 1 
      maxTokens: JourneyContext.getValue("max_tokens")     // 512 
    }; 

    const prompts = [ 
      { role: "system", content: systemPrompt }, 
      { role: "user", content: userPrompt } 
    ]; 

    await IntelligenceService.loadModel(chosenModel, modelParams); 

    const result = await IntelligenceService.invokeModel(
      chosenModel,
      prompts,
      inferenceParams
    ); 

    const output = String(result || "").trim(); 

    JourneyContext.putValue("llm_result", output); 
    onSuccess(true); 

  } catch (e) { 
    tlog("ERROR: " + e); 
    onSuccess(false); 
  } 
})();

The model is loaded once and cached on-device. After that, every invocation runs locally and returns in ~200–700 ms on a modern phone.

This is why I was comfortable choosing a slightly larger model (qwen2.5-3b-instruct-q4_k_m): the download happens once, but inference speed stays fast.

Show Sales or Support Box

Once the intent is classified, the UI reacts instantly. If the result is SALES_NUDGE, we show the shopping flow; if it’s SUPPORT_PRIORITY, we skip everything and take the user straight to the support box.

What's Next?

With fine-tuning, smaller models can become more stable without sacrificing speed or privacy. This opens the door to UX decisions that run entirely on the device—no cloud calls, no tracking, no latency.