How It Works
TinyWhale runs LLMs entirely on your device — across Chrome extensions, browsers, mobile apps, and desktop. No servers, no cloud, no data collection.
Load the Model
An open source LLM (quantized to 4-bit) is downloaded directly into your browser. The model files are cached locally, so subsequent visits load instantly. The entire model is only ~500MB thanks to 4-bit quantization.
Chat Privately
All inference runs locally on your GPU via WebGPU. Your conversations never leave your device — there's no server, no API calls, no telemetry. Once loaded, it even works offline.
Customize & Explore
Fine-tune generation with temperature, top-p, top-k, and more. Upload images for vision tasks — our models support multimodal input. Experiment with different settings to get the best results.
4 Platforms, One Monorepo
Same goal — on-device LLM inference — but each platform uses different engines and model formats optimized for its runtime.
Chrome Extension
Plasmo MV3Run LLMs directly in a Chrome side panel. The service worker loads the model and streams tokens to the UI via chrome.runtime.sendMessage().
- Service worker requires build patches: import.meta.url rewriting, Node module stubs (fs, sharp, onnxruntime-node), and env.js patching for IS_BROWSER_ENV
- Auto-detects WebGPU FP16 support (shader-f16) and falls back to q4 quantization when unavailable
- Supports Firefox MV2 (sidebar_action) and Safari (Xcode conversion) from the same codebase
- Models include text generation (Llama, Qwen, Phi, SmolLM, DeepSeek-R1), vision (Qwen VL, Janus), and speech-to-text (Whisper)
Browser
Next.jsIn-browser AI chat using a Web Worker to keep the main thread responsive. The LLMPipeline singleton loads the model once and handles both text-only and vision (image+text) inference.
- Uses AutoModelForImageTextToText + AutoProcessor for unified text and vision inference in a single model
- TextStreamer provides real-time token streaming with TPS (tokens/sec) and TTFT (time to first token) metrics
- InterruptableStoppingCriteria allows users to stop generation mid-stream
- No build compatibility hacks needed — Web Workers support standard ES modules natively
Mobile
Expo + llama.rnNative mobile inference using llama.rn, a React Native binding to llama.cpp via JSI (JavaScript Interface). GGUF models run with Metal GPU acceleration on iOS.
- JSI bindings bypass the React Native Bridge for near-native performance — direct C++ to JavaScript calls
- iOS uses Metal with up to 99 GPU layers; Android supports CPU and optional OpenCL GPU offloading
- All llama.cpp symbols are prefixed with lm_ to prevent namespace conflicts with other native libraries
- Supports multimodal (vision/audio), embeddings, grammar sampling (GBNF/JSON schema), and tool calling
Desktop
Tauri + RustLightweight desktop app with a Rust backend that calls llama.cpp directly via the llama-cpp-2 crate. Models are auto-downloaded from Hugging Face Hub with real-time progress tracking.
- Tauri uses the system WebView instead of bundling Chromium — dramatically smaller app size vs Electron
- Rust backend manages model lifecycle with Arc<Mutex<ModelStore>> for thread-safe concurrent access
- Hugging Face Hub integration auto-discovers GGUF files and prefers q4_k_m quantization
- Download progress events are emitted via Tauri IPC to the React frontend in real-time
Platform Comparison
| Chrome Extension | Browser | Mobile | Desktop | |
|---|---|---|---|---|
| Inference Engine | Transformers.js | Transformers.js | llama.cpp (JSI) | llama.cpp (Rust) |
| Model Format | ONNX | ONNX | GGUF | GGUF |
| GPU Acceleration | WebGPU | WebGPU | Metal / OpenCL | CPU + GPU |
| Execution Context | Service Worker | Web Worker | Native Thread | Rust Thread |
| Installation | Chrome Web Store | None (URL) | App Store | Download |
Technology Stack
Transformers.js
HuggingFace's library for running ML models in the browser. Powers Chrome extension and Next.js web inference with ONNX models.
llama.cpp
High-performance C++ inference engine for GGUF models. Used on mobile (via llama.rn JSI bindings) and desktop (via Rust llama-cpp-2 crate).
WebGPU
Next-generation GPU API for the web, enabling high-performance computation directly on your graphics hardware in browsers.
ONNX Runtime Web
Microsoft's cross-platform inference engine, optimized for WebGPU and WASM execution in browser environments.
Tauri
Build lightweight desktop apps with a Rust backend and system WebView. No bundled Chromium — dramatically smaller than Electron.
Expo + React Native
Cross-platform mobile framework. Combined with llama.rn's JSI bindings for near-native LLM inference on iOS and Android.
Ready to try it?
Start chatting with AI directly in your browser. No sign-up required, no data collected, completely free.