Skip to main content

Documentation Index

Fetch the complete documentation index at: https://setup.despia.com/llms.txt

Use this file to discover all available pages before exploring further.

Despia Local AI brings real on-device language model inference to your Despia app. Models load via the device’s native AI acceleration stack (Metal, Core ML, and the Apple Neural Engine on iOS; GPU, NNAPI, and the CPU fast-path on Android). Prompts never leave the device, inference is free at any scale, and every feature keeps working in airplane mode.

Installation

npm install despia-intelligence
import intelligence from 'despia-intelligence';

Why we built this

Most apps that ship an AI feature pipe every prompt to somebody else’s server. You pay per token. You wait for a network round-trip. You run a backend proxy to hide your API key. You renegotiate your privacy policy. That is the right tradeoff for a ChatGPT clone. It is the wrong tradeoff for the AI features people actually want to ship: summarising a note, classifying a tag, rewriting a paragraph, extracting JSON from a form, drafting a reply suggestion. We wanted on-device AI that works the way web developers expect: import a package, call a function, get tokens back. No API keys to guard. No backend proxy. No bill-shock. So we built despia-intelligence, a JavaScript bridge from your web code to the device’s native AI hardware. Models live in native memory. Inference runs on the Neural Engine, GPU, or NNAPI. Tokens stream back to your handler.

Cross-platform by design

Despia Local AI is built for both iOS and Android through a unified JavaScript SDK. Write your code once. It works identically on both platforms. No iOS-specific workarounds. No Android edge cases. The API is fully standardized so you can focus on building your app instead of handling platform differences.
import intelligence from 'despia-intelligence'

// Same code. Both platforms.
intelligence.run({
  type:   'text',
  model:  'qwen3-0.6b',
  prompt: 'Summarise this article.',
  stream: true,
}, {
  stream:   (chunk) => output.textContent = chunk,
  complete: (text)  => save(text),
})
Under the hood, Despia handles the platform-specific acceleration paths so your JavaScript stays clean and portable.

Fully on-device. No cloud dependencies.

Despia Local AI runs entirely on the device. There is no cloud component, no proprietary inference backend, no API gateway sitting between you and your users. Private by default. Prompts, user text, and generated tokens never leave the device. Nothing to log, retain, or train on. Use any model. Open-weights text models from Qwen, Liquid LFM, Google Gemma, and Tencent Youtu are downloaded on demand from Hugging Face into the Despia container. New models ship over the air without an SDK upgrade. No per-token fees. Unlike cloud LLM APIs that charge per token, Despia Local AI has zero runtime costs. Run as many completions as the device can handle. Scale to millions of users without inference costs scaling with them. Works offline completely. Once a model is downloaded, the entire system operates without any network connectivity. No license checks, no heartbeats, no server dependencies.

How it works

Gate calls behind intelligence.runtime.ok so the same code works in a desktop browser preview. Make sure the model is installed, then fire intelligence.run() with your params and a handler.
import intelligence from 'despia-intelligence'

if (!intelligence.runtime.ok) return

intelligence.run({
  type:   'text',
  model:  'qwen3-0.6b',
  prompt: 'Summarise this article in three sentences.',
  system: 'Be concise.',
  stream: true,
}, {
  stream:   (chunk) => output.textContent = chunk, // full accumulated text
  complete: (text)  => save(text),
  error:    (err)   => console.error(err.code, err.message),
})
The stream callback receives the full accumulated text on every tick, not a delta. Replace the DOM content rather than appending. The complete callback receives the final response string when generation finishes. If the model is not yet on the device, download it first. Downloads continue in the background even when the app is closed, using NSURLSession on iOS and WorkManager on Android.
intelligence.models.download('qwen3-0.6b', {
  onStart:    ()    => showDownloadUI(),
  onProgress: (pct) => bar.style.width = pct + '%',
  onEnd:      ()    => hideDownloadUI(),
  onError:    (err) => showError(err),
})

Background and return

Inference sessions do not survive the WebView being suspended by the OS. The SDK handles this for you. When the user hits home, opens another app, and comes back, every in-flight job re-fires automatically with the same params and the same handler. Any number of concurrent jobs all resume. Just write your code as if backgrounding does not exist. No visibilitychange, no pagehide, no manual state to serialise.

Available models

Pick by size first, quality second. Smaller models load faster, use less RAM, and work on older devices. Larger models give higher quality at the cost of latency and memory.
ModelSizeFamilyStrengthsTier
lfm2.5-350m350MLiquid LFM2.5Ultra-fast, tiny memory footprintAny
qwen3-0.6b600MAlibaba Qwen3Balanced defaultAny
lfm2-700m700MLiquid LFM2Low-latency general chatAny
gemma-3-1b-it1BGoogle Gemma 3Strong instruction tuningModern
lfm2.5-1.2b-instruct1.2BLiquid LFM2.5Structured instructionsModern
lfm2.5-1.2b-thinking1.2BLiquid LFM2.5Chain-of-thought reasoningModern
qwen3-1.7b1.7BAlibaba Qwen3Stronger reasoningModern
youtu-llm-2b2BTencent YoutuMultilingual (CN/EN)Modern
lfm2-2.6b2.6BLiquid LFM2Higher-quality general chatModern to Flagship
gemma-3n-e4b-it4B effectiveGoogle Gemma 3nMobile-optimisedFlagship
lfm2-8b-a1b8B MoE (1B active)Liquid LFM2Highest quality on-deviceFlagship
All models are published in int4 (smaller, faster) and int8 (higher quality) quantizations. The runtime picks the quantization based on device capability. Small text models land around 200 to 400 MB (int4); medium text models 600 MB to 1.2 GB. Prompt users to download on Wi-Fi.
Discover what is actually installable on the current runtime via intelligence.models.available(). New models ship over the air from Hugging Face and do not require an SDK upgrade.

Use cases

Privacy-Sensitive AI

Medical notes, personal journals, legal documents. Input never leaves the device. Naturally HIPAA and GDPR aligned.

Offline-First Experiences

Travel apps, field service tools, low-signal environments. After first download, everything works in airplane mode.

Low-Latency Interactions

Autocomplete, inline rewrites, tone adjusters. First token in tens of milliseconds with no network hop.

Cost-Bound Features

Per-token billing makes some features economically impossible. Free at any scale on-device.

Frequently asked questions

Yes. Once a model is downloaded, inference runs without any network connection. Works in airplane mode, on a plane, in a tunnel.
Small models (350M to 700M parameters, int4) are typically 200 to 400 MB. Medium models (1 to 2B parameters, int4) are 600 MB to 1.2 GB. Prompt users to download on Wi-Fi.
Yes. Downloads continue in the background using native OS background transfer APIs (NSURLSession on iOS, WorkManager on Android). When the app reopens, the global downloadEnd event fires for any model that completed in the background.
No. The SDK auto-resumes every active inference job when the user returns. Do not wire up visibilitychange, pagehide, or beforeunload for inference yourself.
Each inference job uses one model. Multiple concurrent jobs across the same or different models are supported. On devices with limited RAM, prefer sequential inference for large models.
Yes. Downloading model weights follows the same pattern as navigation apps downloading map tiles or audio apps downloading content. No native code is downloaded post-install.
No. Apple Intelligence and Despia Local AI serve different purposes and can coexist in the same app. Despia Local AI runs on both iOS and Android, does not require iOS 26+, and does not require Apple Intelligence to be enabled by the user.

Resources

NPM Package

despia-intelligence

Reference

Full API: run, models, runtime, events

GitHub

Source on GitHub