Despia Local Intelligence requires Despia V4, which is currently in beta. To request access, email beta@despia.com.
HuggingFace model inference runs on both iOS and Android. The
appleintelligence:// one-shot scheme is iOS only. The device must have sufficient free storage for the selected model before inference can run.What Despia Local Intelligence can do
Depending on which model you load, the on-device runtime handles six distinct capability categories.Text generation
Load a text model, pass a prompt, and receive a response as a full string or streamed token by token. System instructions are supported.
Transcription
Convert audio to text. Batch mode processes a complete file. Streaming mode processes live microphone input in real time with partial results as they arrive.
Embeddings
Produce float vector representations of text, images, or audio for semantic search, similarity ranking, and clustering.
Voice activity detection
Segment audio into speech and non-speech regions. Use it to filter microphone input before passing to a transcription model.
RAG query
Retrieve relevant passages from a corpus attached to a text model. The runtime builds and queries an index over your documents.
Vector index
A standalone on-device vector store. Add documents and embeddings, query by similarity, retrieve by ID. No language model required.
Why on-device
Cloud AI APIs work well for most cases. Despia Local Intelligence is for the cases where they do not.Privacy
With a cloud API, the user’s input goes to a remote server. For apps handling sensitive input - medical notes, personal journals, legal documents - on-device inference means the input never leaves the device.
Offline operation
Cloud APIs require a network connection. Once a HuggingFace model is downloaded, inference runs without any connectivity. Travel apps, field tools, and low-signal environments all benefit.
Latency
A cloud API round-trip adds network overhead on every request. On-device inference starts processing immediately with no round-trip.
Cost
Cloud inference is billed per token. On-device inference has no per-call cost after the model is downloaded.
The tradeoff is real: on-device models are smaller than frontier cloud models. A 600M parameter model running on a phone is not GPT-4. Despia Local Intelligence is the right choice when privacy, offline support, latency, or cost matter more than raw capability.
How it works
Your web code fires a URL scheme call. The native WebView intercepts it, runs inference on-device, and delivers results back through global JavaScript callbacks. Your web code never calls the model directly - the native layer owns the full inference lifecycle.- One-shot (iOS only)
- Streaming (iOS + Android)
appleintelligence:// runs a prompt to completion and calls a named function on window with the full response. The callback defaults to handleAIResponse but can be customized via the callback parameter. On failure, the same callback receives the error message.Model download and storage
Models are hosted on HuggingFace and downloaded to the device’s Application Support directory. Downloads useNSURLSession background transfer on iOS and WorkManager on Android - they continue even if the app is closed. A downloaded model persists across launches with no re-download.
Each model is available in two quantizations:
int4
Smaller file size, faster inference. The right starting point for most use cases.
int8
Higher output quality, larger download. Use when response quality matters more than speed.
Available models
- Text
- Vision
- ASR
- Embedding / VAD / Speaker
| Model | Notes |
|---|---|
lfm2-8b-a1b | Largest text model |
lfm2-2.6b | |
youtu-llm-2b | |
qwen3-1.7b | |
lfm2.5-1.2b-instruct | Strong instruction following |
lfm2.5-1.2b-thinking | Chain-of-thought reasoning |
gemma-3n-e4b-it | |
gemma-3-1b-it | |
qwen3-0.6b | Good starting point |
lfm2-700m | |
lfm2.5-350m | Smallest text model |
Choosing a model
General text tasks
General text tasks
Start with
qwen3-0.6b int4. It is the smallest general-purpose text model and gives you a fast feedback loop while building. Move up if output quality is insufficient.Instruction following
Instruction following
lfm2.5-1.2b-instruct is a reliable middle ground. It follows instructions well and runs in a few seconds on recent devices.Reasoning and math
Reasoning and math
lfm2.5-1.2b-thinking is the right choice for tasks that benefit from chain-of-thought reasoning - math, logic, and multi-step problems.Transcription
Transcription
whisper-tiny and moonshine-base are fast and suitable for real-time use. parakeet-tdt-0.6b-v3 produces higher accuracy at the cost of a larger download.Embeddings and semantic search
Embeddings and semantic search
nomic-embed-text-v2-moe is a strong general-purpose choice. qwen3-embedding-0.6b is smaller and faster.VAD
VAD
silero-vad is lightweight and suited for live audio segmentation. segmentation-3.0 provides higher accuracy.Does this replace Apple Intelligence?
No. Apple Intelligence and Despia Local Intelligence serve different purposes and can coexist in the same app.| Apple Intelligence | Despia Local Intelligence | |
|---|---|---|
| Platform | iOS only | iOS and Android |
| Requires iOS 26+ | Yes | No |
| Apple Intelligence enabled | Required | Not required |
| Text generation | Yes | Yes |
| Transcription | No | Yes |
| Embeddings | No | Yes |
| VAD | No | Yes |
| Vector search | No | Yes |
Frequently asked questions
How large are the models?
How large are the models?
Small models (350M-700M parameters, int4) are typically 200-400MB. Medium models (1-2B parameters, int4) are 600MB to 1.2GB. Prompt users to download on WiFi.
Do I need to handle the download myself?
Do I need to handle the download myself?
No. The native layer manages downloads, caching, retry, and storage. You do not write download logic.
What happens if inference fails?
What happens if inference fails?
window.onMLError receives an object with an errorCode and errorMessage. See the Reference for the full error code table.Can I run multiple models at the same time?
Can I run multiple models at the same time?
Each inference session uses one model. Concurrent sessions across different models are possible but memory-intensive. Prefer sequential inference on devices with limited RAM.
Is this App Store compliant?
Is this App Store compliant?
Yes. Downloading model weights follows the same pattern as navigation apps downloading map tiles or audio apps downloading content. No native code is downloaded post-install.
Can I use the vector index without a language model?
Can I use the vector index without a language model?
Yes. The vector index is a standalone component and operates independently of any model.
Resources
NPM Package
despia-native