How It Works

TinyWhale runs LLMs entirely on your device — across Chrome extensions, browsers, mobile apps, and desktop. No servers, no cloud, no data collection.

Step 01

Load the Model

An open source LLM (quantized to 4-bit) is downloaded directly into your browser. The model files are cached locally, so subsequent visits load instantly. The entire model is only ~500MB thanks to 4-bit quantization.

Step 02

Chat Privately

All inference runs locally on your GPU via WebGPU. Your conversations never leave your device — there's no server, no API calls, no telemetry. Once loaded, it even works offline.

Step 03

Customize & Explore

Fine-tune generation with temperature, top-p, top-k, and more. Upload images for vision tasks — our models support multimodal input. Experiment with different settings to get the best results.

4 Platforms, One Monorepo

Same goal — on-device LLM inference — but each platform uses different engines and model formats optimized for its runtime.

Chrome Extension

Plasmo MV3

Run LLMs directly in a Chrome side panel. The service worker loads the model and streams tokens to the UI via chrome.runtime.sendMessage().

Inference EngineTransformers.js

Model FormatONNX (q4f16)

GPU AccelerationWebGPU

Execution ContextService Worker

Default ModelQwen3.5-0.8B

Service worker requires build patches: import.meta.url rewriting, Node module stubs (fs, sharp, onnxruntime-node), and env.js patching for IS_BROWSER_ENV
Auto-detects WebGPU FP16 support (shader-f16) and falls back to q4 quantization when unavailable
Supports Firefox MV2 (sidebar_action) and Safari (Xcode conversion) from the same codebase
Models include text generation (Llama, Qwen, Phi, SmolLM, DeepSeek-R1), vision (Qwen VL, Janus), and speech-to-text (Whisper)

Browser

Next.js

In-browser AI chat using a Web Worker to keep the main thread responsive. The LLMPipeline singleton loads the model once and handles both text-only and vision (image+text) inference.

Inference EngineTransformers.js

Model FormatONNX (q4f16)

GPU AccelerationWebGPU (Web Worker)

Execution ContextWeb Worker

Default ModelQwen3.5-0.8B

Uses AutoModelForImageTextToText + AutoProcessor for unified text and vision inference in a single model
TextStreamer provides real-time token streaming with TPS (tokens/sec) and TTFT (time to first token) metrics
InterruptableStoppingCriteria allows users to stop generation mid-stream
No build compatibility hacks needed — Web Workers support standard ES modules natively

Mobile

Expo + llama.rn

Native mobile inference using llama.rn, a React Native binding to llama.cpp via JSI (JavaScript Interface). GGUF models run with Metal GPU acceleration on iOS.

Inference Enginellama.cpp (JSI)

Model FormatGGUF

GPU AccelerationMetal (iOS) / OpenCL (Android)

Execution ContextNative Thread

FrameworkExpo SDK 54

JSI bindings bypass the React Native Bridge for near-native performance — direct C++ to JavaScript calls
iOS uses Metal with up to 99 GPU layers; Android supports CPU and optional OpenCL GPU offloading
All llama.cpp symbols are prefixed with lm_ to prevent namespace conflicts with other native libraries
Supports multimodal (vision/audio), embeddings, grammar sampling (GBNF/JSON schema), and tool calling

Desktop

Tauri + Rust

Lightweight desktop app with a Rust backend that calls llama.cpp directly via the llama-cpp-2 crate. Models are auto-downloaded from Hugging Face Hub with real-time progress tracking.

Inference Enginellama-cpp-2 (Rust)

Model FormatGGUF

GPU AccelerationCPU + optional GPU

Execution ContextRust Thread

FrontendVite + React

Tauri uses the system WebView instead of bundling Chromium — dramatically smaller app size vs Electron
Rust backend manages model lifecycle with Arc<Mutex<ModelStore>> for thread-safe concurrent access
Hugging Face Hub integration auto-discovers GGUF files and prefers q4_k_m quantization
Download progress events are emitted via Tauri IPC to the React frontend in real-time

Platform Comparison

	Chrome Extension	Browser	Mobile	Desktop
Inference Engine	Transformers.js	Transformers.js	llama.cpp (JSI)	llama.cpp (Rust)
Model Format	ONNX	ONNX	GGUF	GGUF
GPU Acceleration	WebGPU	WebGPU	Metal / OpenCL	CPU + GPU
Execution Context	Service Worker	Web Worker	Native Thread	Rust Thread
Installation	Chrome Web Store	None (URL)	App Store	Download

Ready to try it?

Start chatting with AI directly in your browser. No sign-up required, no data collected, completely free.

Try the Demo View on GitHub

How It Works

Load the Model

Chat Privately

Customize & Explore

4 Platforms, One Monorepo

Chrome Extension

Browser

Mobile

Desktop

Platform Comparison

Technology Stack

Transformers.js

llama.cpp

WebGPU

ONNX Runtime Web

Tauri

Expo + React Native

Ready to try it?