A few thoughts on OpenAI's Jalapeño chip announcement today:
1. This chip is most likely the first one virtually entirely developed by Codex/GPT. Codex with whatever internal coding model (GPT 5.6/6.0 whatever) coded the entire software stack and most likely the hardware design
2. OpenAI will write all of their inference serving in pure Jalapeño ISA (instruction set architecture). Why? They only need to get say a few production models serving on Jalapeño. They can handwrite with Codex the entire model in pure ISA to get very high performance
3. They are most likely running Codex/GPT in custom RL envs to teach the models direct Jalapeño chip programming at ISA level
4. This is a massive cost savings for OpenAI and only possible IMO due to the breakthroughs in agentic coding. An AI company with frontier coding models can now become a hardware vendor with only a small team of experienced SWEs and an infinite amount of tokens
This is the first chip program fully accelerated by frontier AI.
16 parallel runs of Gemma 4 26B A4B on a single NVIDIA DGX Spark!
Pushing 18 tok/s per instance and a 300 tok/s aggregate. It can even hit 32 parallel runs.
This level of concurrency highlights how efficient the architecture is.
🙏 Thanks to the @NVIDIAAI team for highlighting DFlash support on vLLM!
With DFlash speculative decoding, swapping EAGLE-3 for a DFlash checkpoint is a config-only change — no code edits needed.
It runs through the open-source Speculators library, which links the DFlash drafter to the target model's hidden states in the vLLM inference path.
On Gemma-4 31B on a single Blackwell Ultra GPU, this delivers up to 5.8x higher throughput at the same concurrency over autoregressive decoding:
🧮 Math500 — 5.8x
➕ GSM8K — 5.3x
💻 HumanEval — 5.6x
🐍 MBPP — 4.4x
Read the blog here! 👇