How It Works

TinyWhale runs LLMs entirely on your device — across Chrome extensions, browsers, mobile apps, and desktop. No servers, no cloud, no data collection.

Step 01

Load the Model

An open source LLM (quantized to 4-bit) is downloaded directly into your browser. The model files are cached locally, so subsequent visits load instantly. The entire model is only ~500MB thanks to 4-bit quantization.

Step 02

Chat Privately

All inference runs locally on your GPU via WebGPU. Your conversations never leave your device — there's no server, no API calls, no telemetry. Once loaded, it even works offline.

Step 03

Customize & Explore

Fine-tune generation with temperature, top-p, top-k, and more. Upload images for vision tasks — our models support multimodal input. Experiment with different settings to get the best results.

4 Platforms, One Monorepo

Same goal — on-device LLM inference — but each platform uses different engines and model formats optimized for its runtime.

Chrome Extension

Plasmo MV3

Run LLMs directly in a Chrome side panel. The service worker loads the model and streams tokens to the UI via chrome.runtime.sendMessage().

Inference EngineTransformers.js
Model FormatONNX (q4f16)
GPU AccelerationWebGPU
Execution ContextService Worker
Default ModelQwen3.5-0.8B
  • Service worker requires build patches: import.meta.url rewriting, Node module stubs (fs, sharp, onnxruntime-node), and env.js patching for IS_BROWSER_ENV
  • Auto-detects WebGPU FP16 support (shader-f16) and falls back to q4 quantization when unavailable
  • Supports Firefox MV2 (sidebar_action) and Safari (Xcode conversion) from the same codebase
  • Models include text generation (Llama, Qwen, Phi, SmolLM, DeepSeek-R1), vision (Qwen VL, Janus), and speech-to-text (Whisper)

Browser

Next.js

In-browser AI chat using a Web Worker to keep the main thread responsive. The LLMPipeline singleton loads the model once and handles both text-only and vision (image+text) inference.

Inference EngineTransformers.js
Model FormatONNX (q4f16)
GPU AccelerationWebGPU (Web Worker)
Execution ContextWeb Worker
Default ModelQwen3.5-0.8B
  • Uses AutoModelForImageTextToText + AutoProcessor for unified text and vision inference in a single model
  • TextStreamer provides real-time token streaming with TPS (tokens/sec) and TTFT (time to first token) metrics
  • InterruptableStoppingCriteria allows users to stop generation mid-stream
  • No build compatibility hacks needed — Web Workers support standard ES modules natively

Mobile

Expo + llama.rn

Native mobile inference using llama.rn, a React Native binding to llama.cpp via JSI (JavaScript Interface). GGUF models run with Metal GPU acceleration on iOS.

Inference Enginellama.cpp (JSI)
Model FormatGGUF
GPU AccelerationMetal (iOS) / OpenCL (Android)
Execution ContextNative Thread
FrameworkExpo SDK 54
  • JSI bindings bypass the React Native Bridge for near-native performance — direct C++ to JavaScript calls
  • iOS uses Metal with up to 99 GPU layers; Android supports CPU and optional OpenCL GPU offloading
  • All llama.cpp symbols are prefixed with lm_ to prevent namespace conflicts with other native libraries
  • Supports multimodal (vision/audio), embeddings, grammar sampling (GBNF/JSON schema), and tool calling

Desktop

Tauri + Rust

Lightweight desktop app with a Rust backend that calls llama.cpp directly via the llama-cpp-2 crate. Models are auto-downloaded from Hugging Face Hub with real-time progress tracking.

Inference Enginellama-cpp-2 (Rust)
Model FormatGGUF
GPU AccelerationCPU + optional GPU
Execution ContextRust Thread
FrontendVite + React
  • Tauri uses the system WebView instead of bundling Chromium — dramatically smaller app size vs Electron
  • Rust backend manages model lifecycle with Arc<Mutex<ModelStore>> for thread-safe concurrent access
  • Hugging Face Hub integration auto-discovers GGUF files and prefers q4_k_m quantization
  • Download progress events are emitted via Tauri IPC to the React frontend in real-time

Platform Comparison

Chrome ExtensionBrowserMobileDesktop
Inference EngineTransformers.jsTransformers.jsllama.cpp (JSI)llama.cpp (Rust)
Model FormatONNXONNXGGUFGGUF
GPU AccelerationWebGPUWebGPUMetal / OpenCLCPU + GPU
Execution ContextService WorkerWeb WorkerNative ThreadRust Thread
InstallationChrome Web StoreNone (URL)App StoreDownload

Ready to try it?

Start chatting with AI directly in your browser. No sign-up required, no data collected, completely free.