@hsu_steve Here’s a live demo of llama-3.2-1B that runs in the browser on newer phones like iPhone 15, that I made for tinygrad: https://t.co/xQoO9aq6mL
You can force usage of the WASM backend (even if WebGPU is enabled) using this link: https://t.co/v6GKem3KUa
The model uses 1.2 GB of memory, with the llama-3.2-1B weights having been quantized to int8 (with float32 scales) using tinygrad.
Check out tinychat, a browser LLM app built with @__tinygrad__, which runs llama-3.2-1B locally on both WebGPU and WASM, including on newer phones such as iPhone 15. 🔗👇 🧵
For best performance, try https://t.co/xQoO9apyxd on a PC/laptop with WebGPU enabled. If WebGPU isn't enabled (or your device doesn't support large enough WebGPU buffers), then the app will automatically fall back to using WASM, which still works but is slower.