Despia Local AI brings real on-device language model inference to your Despia app. Models load via the device’s native AI acceleration stack (Metal, Core ML, and the Apple Neural Engine on iOS; GPU, NNAPI, and the CPU fast-path on Android). Prompts never leave the device, inference is free at any scale, and every feature keeps working in airplane mode.Documentation Index
Fetch the complete documentation index at: https://setup.despia.com/llms.txt
Use this file to discover all available pages before exploring further.
Installation
- Bundle
- CDN
Why we built this
Most apps that ship an AI feature pipe every prompt to somebody else’s server. You pay per token. You wait for a network round-trip. You run a backend proxy to hide your API key. You renegotiate your privacy policy. That is the right tradeoff for a ChatGPT clone. It is the wrong tradeoff for the AI features people actually want to ship: summarising a note, classifying a tag, rewriting a paragraph, extracting JSON from a form, drafting a reply suggestion. We wanted on-device AI that works the way web developers expect: import a package, call a function, get tokens back. No API keys to guard. No backend proxy. No bill-shock. So we builtdespia-intelligence, a JavaScript bridge from your web code to the device’s native AI hardware. Models live in native memory. Inference runs on the Neural Engine, GPU, or NNAPI. Tokens stream back to your handler.
Cross-platform by design
Despia Local AI is built for both iOS and Android through a unified JavaScript SDK. Write your code once. It works identically on both platforms. No iOS-specific workarounds. No Android edge cases. The API is fully standardized so you can focus on building your app instead of handling platform differences.Fully on-device. No cloud dependencies.
Despia Local AI runs entirely on the device. There is no cloud component, no proprietary inference backend, no API gateway sitting between you and your users. Private by default. Prompts, user text, and generated tokens never leave the device. Nothing to log, retain, or train on. Use any model. Open-weights text models from Qwen, Liquid LFM, Google Gemma, and Tencent Youtu are downloaded on demand from Hugging Face into the Despia container. New models ship over the air without an SDK upgrade. No per-token fees. Unlike cloud LLM APIs that charge per token, Despia Local AI has zero runtime costs. Run as many completions as the device can handle. Scale to millions of users without inference costs scaling with them. Works offline completely. Once a model is downloaded, the entire system operates without any network connectivity. No license checks, no heartbeats, no server dependencies.How it works
Gate calls behindintelligence.runtime.ok so the same code works in a desktop browser preview. Make sure the model is installed, then fire intelligence.run() with your params and a handler.
stream callback receives the full accumulated text on every tick, not a delta. Replace the DOM content rather than appending. The complete callback receives the final response string when generation finishes.
If the model is not yet on the device, download it first. Downloads continue in the background even when the app is closed, using NSURLSession on iOS and WorkManager on Android.
Background and return
Inference sessions do not survive the WebView being suspended by the OS. The SDK handles this for you. When the user hits home, opens another app, and comes back, every in-flight job re-fires automatically with the same params and the same handler. Any number of concurrent jobs all resume. Just write your code as if backgrounding does not exist. Novisibilitychange, no pagehide, no manual state to serialise.
Available models
Pick by size first, quality second. Smaller models load faster, use less RAM, and work on older devices. Larger models give higher quality at the cost of latency and memory.| Model | Size | Family | Strengths | Tier |
|---|---|---|---|---|
lfm2.5-350m | 350M | Liquid LFM2.5 | Ultra-fast, tiny memory footprint | Any |
qwen3-0.6b | 600M | Alibaba Qwen3 | Balanced default | Any |
lfm2-700m | 700M | Liquid LFM2 | Low-latency general chat | Any |
gemma-3-1b-it | 1B | Google Gemma 3 | Strong instruction tuning | Modern |
lfm2.5-1.2b-instruct | 1.2B | Liquid LFM2.5 | Structured instructions | Modern |
lfm2.5-1.2b-thinking | 1.2B | Liquid LFM2.5 | Chain-of-thought reasoning | Modern |
qwen3-1.7b | 1.7B | Alibaba Qwen3 | Stronger reasoning | Modern |
youtu-llm-2b | 2B | Tencent Youtu | Multilingual (CN/EN) | Modern |
lfm2-2.6b | 2.6B | Liquid LFM2 | Higher-quality general chat | Modern to Flagship |
gemma-3n-e4b-it | 4B effective | Google Gemma 3n | Mobile-optimised | Flagship |
lfm2-8b-a1b | 8B MoE (1B active) | Liquid LFM2 | Highest quality on-device | Flagship |
int4 (smaller, faster) and int8 (higher quality) quantizations. The runtime picks the quantization based on device capability. Small text models land around 200 to 400 MB (int4); medium text models 600 MB to 1.2 GB. Prompt users to download on Wi-Fi.
Discover what is actually installable on the current runtime via
intelligence.models.available(). New models ship over the air from Hugging Face and do not require an SDK upgrade.Use cases
Privacy-Sensitive AI
Medical notes, personal journals, legal documents. Input never leaves the device. Naturally HIPAA and GDPR aligned.
Offline-First Experiences
Travel apps, field service tools, low-signal environments. After first download, everything works in airplane mode.
Low-Latency Interactions
Autocomplete, inline rewrites, tone adjusters. First token in tens of milliseconds with no network hop.
Cost-Bound Features
Per-token billing makes some features economically impossible. Free at any scale on-device.
Frequently asked questions
Does this work offline?
Does this work offline?
Yes. Once a model is downloaded, inference runs without any network connection. Works in airplane mode, on a plane, in a tunnel.
How large are the models?
How large are the models?
Small models (350M to 700M parameters, int4) are typically 200 to 400 MB. Medium models (1 to 2B parameters, int4) are 600 MB to 1.2 GB. Prompt users to download on Wi-Fi.
Do downloads continue if the user closes the app?
Do downloads continue if the user closes the app?
Yes. Downloads continue in the background using native OS background transfer APIs (
NSURLSession on iOS, WorkManager on Android). When the app reopens, the global downloadEnd event fires for any model that completed in the background.Do I need to handle backgrounding for inference?
Do I need to handle backgrounding for inference?
No. The SDK auto-resumes every active inference job when the user returns. Do not wire up
visibilitychange, pagehide, or beforeunload for inference yourself.Can I run multiple models at the same time?
Can I run multiple models at the same time?
Each inference job uses one model. Multiple concurrent jobs across the same or different models are supported. On devices with limited RAM, prefer sequential inference for large models.
Is this App Store compliant?
Is this App Store compliant?
Yes. Downloading model weights follows the same pattern as navigation apps downloading map tiles or audio apps downloading content. No native code is downloaded post-install.
Does this replace Apple Intelligence?
Does this replace Apple Intelligence?
No. Apple Intelligence and Despia Local AI serve different purposes and can coexist in the same app. Despia Local AI runs on both iOS and Android, does not require iOS 26+, and does not require Apple Intelligence to be enabled by the user.
Resources
NPM Package
despia-intelligence
Reference
Full API:
run, models, runtime, eventsGitHub
Source on GitHub