Skip to main content
Despia Local Intelligence is in beta. The API spec will likely change before the official launch. A dedicated NPM package for Local Intelligence will be released before launch to make setup more convenient - the current despia-native integration is temporary.
Despia Local Intelligence requires Despia V4, which is currently in beta. To request access, email beta@despia.com.
Despia Local Intelligence brings on-device machine learning to Despia apps. Models are downloaded once from HuggingFace and cached locally. All inference runs on-device - no cloud calls, no round-trip latency, no data leaving the device.
HuggingFace model inference runs on both iOS and Android. The appleintelligence:// one-shot scheme is iOS only. The device must have sufficient free storage for the selected model before inference can run.

What Despia Local Intelligence can do

Depending on which model you load, the on-device runtime handles six distinct capability categories.

Text generation

Load a text model, pass a prompt, and receive a response as a full string or streamed token by token. System instructions are supported.

Transcription

Convert audio to text. Batch mode processes a complete file. Streaming mode processes live microphone input in real time with partial results as they arrive.

Embeddings

Produce float vector representations of text, images, or audio for semantic search, similarity ranking, and clustering.

Voice activity detection

Segment audio into speech and non-speech regions. Use it to filter microphone input before passing to a transcription model.

RAG query

Retrieve relevant passages from a corpus attached to a text model. The runtime builds and queries an index over your documents.

Vector index

A standalone on-device vector store. Add documents and embeddings, query by similarity, retrieve by ID. No language model required.

Why on-device

Cloud AI APIs work well for most cases. Despia Local Intelligence is for the cases where they do not.

Privacy

With a cloud API, the user’s input goes to a remote server. For apps handling sensitive input - medical notes, personal journals, legal documents - on-device inference means the input never leaves the device.

Offline operation

Cloud APIs require a network connection. Once a HuggingFace model is downloaded, inference runs without any connectivity. Travel apps, field tools, and low-signal environments all benefit.

Latency

A cloud API round-trip adds network overhead on every request. On-device inference starts processing immediately with no round-trip.

Cost

Cloud inference is billed per token. On-device inference has no per-call cost after the model is downloaded.
The tradeoff is real: on-device models are smaller than frontier cloud models. A 600M parameter model running on a phone is not GPT-4. Despia Local Intelligence is the right choice when privacy, offline support, latency, or cost matter more than raw capability.

How it works

Your web code fires a URL scheme call. The native WebView intercepts it, runs inference on-device, and delivers results back through global JavaScript callbacks. Your web code never calls the model directly - the native layer owns the full inference lifecycle.
appleintelligence:// runs a prompt to completion and calls a named function on window with the full response. The callback defaults to handleAIResponse but can be customized via the callback parameter. On failure, the same callback receives the error message.

Model download and storage

Models are hosted on HuggingFace and downloaded to the device’s Application Support directory. Downloads use NSURLSession background transfer on iOS and WorkManager on Android - they continue even if the app is closed. A downloaded model persists across launches with no re-download. Each model is available in two quantizations:

int4

Smaller file size, faster inference. The right starting point for most use cases.

int8

Higher output quality, larger download. Use when response quality matters more than speed.

Available models

ModelNotes
lfm2-8b-a1bLargest text model
lfm2-2.6b
youtu-llm-2b
qwen3-1.7b
lfm2.5-1.2b-instructStrong instruction following
lfm2.5-1.2b-thinkingChain-of-thought reasoning
gemma-3n-e4b-it
gemma-3-1b-it
qwen3-0.6bGood starting point
lfm2-700m
lfm2.5-350mSmallest text model

Choosing a model

Start with qwen3-0.6b int4. It is the smallest general-purpose text model and gives you a fast feedback loop while building. Move up if output quality is insufficient.
lfm2.5-1.2b-instruct is a reliable middle ground. It follows instructions well and runs in a few seconds on recent devices.
lfm2.5-1.2b-thinking is the right choice for tasks that benefit from chain-of-thought reasoning - math, logic, and multi-step problems.
whisper-tiny and moonshine-base are fast and suitable for real-time use. parakeet-tdt-0.6b-v3 produces higher accuracy at the cost of a larger download.
silero-vad is lightweight and suited for live audio segmentation. segmentation-3.0 provides higher accuracy.

Does this replace Apple Intelligence?

No. Apple Intelligence and Despia Local Intelligence serve different purposes and can coexist in the same app.
Apple IntelligenceDespia Local Intelligence
PlatformiOS onlyiOS and Android
Requires iOS 26+YesNo
Apple Intelligence enabledRequiredNot required
Text generationYesYes
TranscriptionNoYes
EmbeddingsNoYes
VADNoYes
Vector searchNoYes

Frequently asked questions

Small models (350M-700M parameters, int4) are typically 200-400MB. Medium models (1-2B parameters, int4) are 600MB to 1.2GB. Prompt users to download on WiFi.
No. The native layer manages downloads, caching, retry, and storage. You do not write download logic.
window.onMLError receives an object with an errorCode and errorMessage. See the Reference for the full error code table.
Each inference session uses one model. Concurrent sessions across different models are possible but memory-intensive. Prefer sequential inference on devices with limited RAM.
Yes. Downloading model weights follows the same pattern as navigation apps downloading map tiles or audio apps downloading content. No native code is downloaded post-install.
Yes. The vector index is a standalone component and operates independently of any model.

Resources

NPM Package

despia-native