before all of our devices have dedicated inference accelerators, the path forward to bring ~GPT-5 capability and experience to edge devices might be enough with a mix of low-bit weights, KV cache quantization, speculation, and some small variations over the current architectures
I am betting big time on GLM 6
There are many recent papers with great pre-training optimizations (e.g. DSv4)
Now, if Zhipu uses some of that (+ their own novel research), and top it off with their current post-training regime, we're looking at an amazing SOTA in the making
The hardest problems are rarely solved by adding more complexity to the solution -- they are solved by reframing the question until a simpler, clearer answer reveals itself.
I gave Fable 5 one job: write custom WebGPU kernels for Gemma 4 inference.
It climbed to 84 tok/s, then hit a wall, insisting further optimization was impossible.
Hours later, Anthropic rolled back invisible LLM development safeguards, and it hit 255 tok/s.
The next day, access to Fable 5 was suspended globally.
For WWDC I was hoping for realtime Siri AI. The UX pattern of holding a button or typing (who types in 2026??) and asking it to do stuff feels like a dying paradigm akin to the floppy disc era of Mac.
I wanted a canvas to take over and go full immersion with voice and personalization. I didn’t want a running chatbot list view on my phone. I wanted the ability to use the SDK to keep user privacy in mind but build unique experiences across the OS.
I want the concept of “apps” to die and for developers to focus on making things magical for users. We are entering a new era, where in the last half of this decade humanity will solve the most complex problems faster than ever. Things should feel like they are part of a bigger system rather than “download my app to fix cancer.”
OpenAI GPT Realtime 2 is incredibly fast and extremely capable right now. I think it’s only a matter of time for people to realize where the puck is going and to push humanity in a new direction for our era of high intelligence.
for single-tenant use cases, instead of multiple (different!) experts used for each token generated, which in practice requires all experts to be loaded, it would be nice if we just needed a few expert(s) for a whole response.
for single-tenant use cases, instead of multiple (different!) experts used for each token generated, which in practice requires all experts to be loaded, it would be nice if we just needed a few expert(s) for a whole response.