@0xSero@tszzl Y’all please check out my research! 30B model CPU Only at 14 t/s because of quantum processors informing a Karpathy loop!
https://t.co/4aaRzFVa7w
@thsottiaux Codex helped me run a quantum Karpathy loop to tune a 30B model to run at 14 t/s on a 8GB MacBook from 2017. I love Codex 🙏
https://t.co/4aaRzFVa7w
@TheRealAdamG Codex helped me run a quantum Karpathy loop to tune a 30B model to run at 14 t/s on a 8GB MacBook from 2017. I love Codex 🙏
https://t.co/4aaRzFVa7w
15,489% improvement over the baseline while preserving coherent output at 14.03 t/s after using a quantum computer to help fine-tune hyperparameters on a legacy no-GPU device.
I bought an old 2017 MacBook Air at Goodwill because it was not working. It has an Intel processor, 8 GB of RAM, and no GPU. I fixed it and turned it into an AI experiment machine.
Dan Woods @danveloper inspired me by getting a big model to run on a small machine. I thought, let’s see what this pre-Attention Is All You Need, no-GPU Goodwill box can do.
I started off at 0.09 tokens per second with llama.cpp and a Qwen 30B MoE coding model.
I was using Codex on that same machine, and I asked it to look up @karpathy (Andrej Karpathy) style autoresearch project. Basically, I wanted Codex to run an automated experiment cycle: test settings, measure tokens/sec and output quality, then suggest the next candidate.
It was awesome. We went from 0.09 t/s to almost 2 t/s in just a couple of minutes. Then I let it run and came back to see it was almost 4 t/s. After another 12 hours of coaching, we hit a wall at 6.49 t/s.
I was so excited.
Then… it hit me.
Quantum.
I literally did not even know if I could access a quantum processor, or QPU. I looked it up, and Bingo: IBM had a free access path that let me get an API key and run a small amount of quantum compute. I got one. It took about five seconds. I love @IBMQuantum !
The model was still running locally on the old MacBook Air through llama.cpp, while the QPU helped with was searching the weird hyperparameter space.
I designed an MCP harness to act as the go-between for the QPU and the actual machine. We had all of these knobs: KV cache, page cache, layers, swaps, thread settings, batch settings, and on and on. The QPU has its own functions and hooks, so the harness mapped those local knobs into the QPU workflow and let the two systems work together.
Then we started a new Karpathy-style loop informed by the QPU results.
At first, nothing happened. The QPU-suggested experiments were coming in worse than our 6.49 t/s high-water mark.
But then, after only a few iterations, we were at 7 t/s.
I about fell out of my chair and spilled my coffee.
Then it just went supernova.
It was surreal.
Suddenly, it was 12 t/s. I was like, “We have to call the Pentagon.” Lol. No, but it was mind-blowing. From 0.09 to 12 t/s on the same metal? The quantum-assisted search loop was finding hyperparameter combinations that ChatGPT 5.5 and the prior experiments had not found.
That was some kind of horizon, because over the next 8 hours we kept pushing. The gains were not as drastic after that, but they were still significant.
It eventually got to over 16 t/s, but it lost coherence. The output became garbled. So I treated that as a failed run and backed it off.
The stable quality-gated result was 14.03 t/s with a 16k context window. At that speed, it was still producing coherent and factual outputs in my evaluations, which ranged from short prompts and responses to longer-context prompts and responses.
The final stable result was a jump from 0.09 t/s to 14.03 t/s.
That is about a 156x improvement from the original baseline.
As a percentage increase, that is roughly 15,489%.
On a 2017 Intel MacBook Air from Goodwill.
No GPU. No cloud inference. Same machine. Same basic local setup.
Check out the project here:
https://t.co/9yjaSGkX0C
@ggerganov, @qiskit, @simonw, wanted to show y’all, too!! 🙏