If you are sceptical about local AI...
AI unlocking huge performance improvements.
At the same time, breaking down barriers.
All these improvements lead to local inference at exceptional quality!
ByteDance has published a paper that should make every NVIDIA investor sweat.
They trained an AI that writes CUDA better than humans experts.
They call it CUDA Agent.
And it completely rewrites the economics of AI hardware.
They built a massive agentic reinforcement learning loop. The AI writes a kernel, compiles it, profiles the hardware, analyzes the bottlenecks, and rewrites the code until it's flawless.
It learned how to optimize memory access patterns and hardware tiling strategies that traditional compilers miss.
The results are staggering.
On the industry-standard KernelBench, CUDA Agent completely destroyed traditional compilers.
It delivered code that runs up to 3.2x faster than PyTorch's native execution.
On the hardest, most complex models, it beat the strongest proprietary models in the world—including Claude Opus 4.5 and Gemini 3 Pro, by 40%.
It didn't just match human experts. It started discovering optimizations that static compilers literally cannot see.
Here is why this is a massive threat to NVIDIA.
NVIDIA's dominance relies on the fact that CUDA is incredibly hard to master. Developers get locked in because optimizing code for other chips is too painful.
But if an AI agent can autonomously generate hyper-optimized hardware kernels...
You don't need a team of $500k a year CUDA engineers to build world-class infrastructure.
And if an AI can autonomously master CUDA, it can master AMD's ROCm. Or custom silicon.
The impenetrable software wall protecting NVIDIA's monopoly just got breached by a reinforcement learning loop.
If anyone can automatically squeeze maximum performance out of any chip...
Hardware becomes a commodity.
@MarketPalmer_ Automation drives this. Any process that can be automated, will drive down unit cost
Good example: home delivery.
- warehouse logistics is mostly automated
- last mile still relies on humans
What to watch:
- robo-taxi
- all logistics - trucking, shipping next
OpenAI's GPT-OSS-120B runs on a single RTX 5090.
it's a 59GB model in native MXFP4. it doesn't fit in 32GB of VRAM.
the move is MoE offload: keep attention on the GPU, spill the expert weights to system RAM (llama.cpp --n-cpu-moe).
this way, only 5.1B of 117B params fire per token, so the CPU side stays cheap.
with reasoning on, measured on my box, temperature 0, ~100 items per task (MMLU 114):
- MMLU 89.5
- GSM8K 97.0
- HumanEval 98.0 pass@1
- ARC-Challenge 95.0
that's a good frontier-grade scores, on one consumer GPU.
~~~
it is quite slow tho: 47 tok/s generation.
that's because the experts live in RAM, so token speed waits on the CPU, not the 5090.
prefill is fine with 473 tok/s at 512 ctx. it is generation that pays the offload tax.
the model is usable, not fast. but you get a real frontier model you fully own, on hardware you can buy, for the price of patience.
I replaced all the espresso beans in the pantry with decaf 4 weeks ago.
I didn't announce the change.
The placebo effect carried the junior associates through the first 72 hours.
By day 4, the fatigue began to manifest as open weeping in the hallways.
One 2nd-year asked if I was feeling the strange lethargy going around the office.
I looked him dead in the eye while sipping a Red Bull.
I told him my energy comes from a pure passion for corporate governance.
He immediately looked ashamed of his own biology.
Yesterday, I switched it back to the highest caffeine roast available.
Now they are vibrating through the corridors like disturbed hornets.
You can't control a team until you control their central nervous systems.
We spent the last few months overhauling our logistics infrastructure around refurbishment, and today we launched a broad set of refurbished products into the Framework Outlet. We'll be able to turn around customer returns into refurbs faster now too!
By default, Aphrodite runs one model per server instance.
Enable multi-model mode - a single API server can load and serve multiple independent models simultaneously.
Each additional model gets its own engine/worker (with some memory overhead, roughly ~1GB extra per model for the CUDA context).
APHRODITE_SERVER_DEV_MODE=1 \
APHRODITE_ENABLE_DYNAMIC_KV_CACHE=1 \
APHRODITE_ENABLE_MULTI_MODEL=1 \
aphrodite run <first-model> \
--enable-inline-model-loading \
[other flags like --max-model-len, --gpu-memory-utilization, etc.]
You can just do stuff! @Openrouter has a number of "*-free" models.
Yes, sometimes they are rate-limited, but amongst the slightly lesser models you should manage to find one that works for you.
Just start
For all the gpu poor and actually just poor bros out there
Kimi k2.6 has a free endpoint on openrouter that’s 120tps
Does this thing just have really low limits or what?
someone 3D printed a full mini ITX PC case for $18
a file, a printer, and you pick the color customization that no factory offers
perks of having 3D printer
My current goal on X is to get my follower count to 1000+ to be eligible for the monetisation.
Rather than just posting stuff - I still do - I am trying to drip content to see if it makes any difference to my engagement stats.
I started using @buffer to schedule my tweets.
Today I discovered the thread feature (yes, I have a lot to learn...)
- Add Tweet button
I think threads are better than long form - ppl (on X) seem to have lost the patience to consume long form.
Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together.
(The test: place each element at the right pixel position on a blank form image, not type into a field.)
Setup:
> Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool).
> I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height.
> The blue boxes on the screen are its detections. Look how tight they are — it nails every field.
Result:
> Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct.
> Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas.
> Character-box alignment still a touch loose, but every value is where it belongs.
> 9m10s, 224.5k input, 24.3k output, 21 turns.
Why it matters:
> Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can.
> A combination of small models can do the work of a single large one.