Derek Colley @DerekColley_ - Twitter Profile

If you are sceptical about local AI... AI unlocking huge performance improvements. At the same time, breaking down barriers. All these improvements lead to local inference at exceptional quality!

How To AI

@HowToAI_

about 19 hours ago

ByteDance has published a paper that should make every NVIDIA investor sweat. They trained an AI that writes CUDA better than humans experts. They call it CUDA Agent. And it completely rewrites the economics of AI hardware. They built a massive agentic reinforcement learning loop. The AI writes a kernel, compiles it, profiles the hardware, analyzes the bottlenecks, and rewrites the code until it's flawless. It learned how to optimize memory access patterns and hardware tiling strategies that traditional compilers miss. The results are staggering. On the industry-standard KernelBench, CUDA Agent completely destroyed traditional compilers. It delivered code that runs up to 3.2x faster than PyTorch's native execution. On the hardest, most complex models, it beat the strongest proprietary models in the world—including Claude Opus 4.5 and Gemini 3 Pro, by 40%. It didn't just match human experts. It started discovering optimizations that static compilers literally cannot see. Here is why this is a massive threat to NVIDIA. NVIDIA's dominance relies on the fact that CUDA is incredibly hard to master. Developers get locked in because optimizing code for other chips is too painful. But if an AI agent can autonomously generate hyper-optimized hardware kernels... You don't need a team of $500k a year CUDA engineers to build world-class infrastructure. And if an AI can autonomously master CUDA, it can master AMD's ROCm. Or custom silicon. The impenetrable software wall protecting NVIDIA's monopoly just got breached by a reinforcement learning loop. If anyone can automatically squeeze maximum performance out of any chip... Hardware becomes a commodity.

HowToAI_'s tweet photo. ByteDance has published a paper that should make every NVIDIA investor sweat.

They trained an AI that writes CUDA better than humans experts.

They call it CUDA Agent.

And it completely rewrites the economics of AI hardware.

They built a massive agentic reinforcement learning loop. The AI writes a kernel, compiles it, profiles the hardware, analyzes the bottlenecks, and rewrites the code until it's flawless.

It learned how to optimize memory access patterns and hardware tiling strategies that traditional compilers miss.

The results are staggering.

On the industry-standard KernelBench, CUDA Agent completely destroyed traditional compilers.

It delivered code that runs up to 3.2x faster than PyTorch's native execution.

On the hardest, most complex models, it beat the strongest proprietary models in the world—including Claude Opus 4.5 and Gemini 3 Pro, by 40%.

It didn't just match human experts. It started discovering optimizations that static compilers literally cannot see.

Here is why this is a massive threat to NVIDIA.

NVIDIA's dominance relies on the fact that CUDA is incredibly hard to master. Developers get locked in because optimizing code for other chips is too painful.

But if an AI agent can autonomously generate hyper-optimized hardware kernels...

You don't need a team of $500k a year CUDA engineers to build world-class infrastructure.

And if an AI can autonomously master CUDA, it can master AMD's ROCm. Or custom silicon.

The impenetrable software wall protecting NVIDIA's monopoly just got breached by a reinforcement learning loop.

If anyone can automatically squeeze maximum performance out of any chip...

Hardware becomes a commodity.

71

1K

253

1K

116K

0

35

DerekColley_ retweeted

Simon Holland

@simoncholland

1 day ago

Daughter is home from college and wants to know if we want to go do something at 9 PM. Uh, we don’t do that here.

171

43K

815

578

1M

Who to follow

Solution Consultant for Oracle NetSuite. I enjoy skiing, mountain biking & rock music. I love to travel.

Derek Colley

@DerekColley_

about 17 hours ago

@MarketPalmer_ Automation drives this. Any process that can be automated, will drive down unit cost Good example: home delivery. - warehouse logistics is mostly automated - last mile still relies on humans What to watch: - robo-taxi - all logistics - trucking, shipping next

0

2

0

433

DerekColley_ retweeted

witcheer

@witcheer

about 19 hours ago

OpenAI's GPT-OSS-120B runs on a single RTX 5090. it's a 59GB model in native MXFP4. it doesn't fit in 32GB of VRAM. the move is MoE offload: keep attention on the GPU, spill the expert weights to system RAM (llama.cpp --n-cpu-moe). this way, only 5.1B of 117B params fire per token, so the CPU side stays cheap. with reasoning on, measured on my box, temperature 0, ~100 items per task (MMLU 114): - MMLU 89.5 - GSM8K 97.0 - HumanEval 98.0 pass@1 - ARC-Challenge 95.0 that's a good frontier-grade scores, on one consumer GPU. ~~~ it is quite slow tho: 47 tok/s generation. that's because the experts live in RAM, so token speed waits on the CPU, not the 5090. prefill is fine with 473 tok/s at 512 ctx. it is generation that pays the offload tax. the model is usable, not fast. but you get a real frontier model you fully own, on hardware you can buy, for the price of patience.

witcheer's tweet photo. OpenAI's GPT-OSS-120B runs on a single RTX 5090.

it's a 59GB model in native MXFP4. it doesn't fit in 32GB of VRAM.
the move is MoE offload: keep attention on the GPU, spill the expert weights to system RAM (llama.cpp --n-cpu-moe).

this way, only 5.1B of 117B params fire per token, so the CPU side stays cheap.

with reasoning on, measured on my box, temperature 0, ~100 items per task (MMLU 114):

- MMLU 89.5
- GSM8K 97.0
- HumanEval 98.0 pass@1
- ARC-Challenge 95.0

that's a good frontier-grade scores, on one consumer GPU.

~~~
it is quite slow tho: 47 tok/s generation.

that's because the experts live in RAM, so token speed waits on the CPU, not the 5090.

prefill is fine with 473 tok/s at 512 ctx. it is generation that pays the offload tax.

the model is usable, not fast. but you get a real frontier model you fully own, on hardware you can buy, for the price of patience.

11

99

12

82

12K

Derek Colley

@DerekColley_

about 18 hours ago

Tell you're not using MFA for your logins, without using the phrase MFA?

0

6

Derek Colley

@DerekColley_

about 18 hours ago

There is a place in hell for you sir! Don't mess with my coffee...

Bill Moore, Esq.

@lawyer_memes

1 day ago

I replaced all the espresso beans in the pantry with decaf 4 weeks ago. I didn't announce the change. The placebo effect carried the junior associates through the first 72 hours. By day 4, the fatigue began to manifest as open weeping in the hallways. One 2nd-year asked if I was feeling the strange lethargy going around the office. I looked him dead in the eye while sipping a Red Bull. I told him my energy comes from a pure passion for corporate governance. He immediately looked ashamed of his own biology. Yesterday, I switched it back to the highest caffeine roast available. Now they are vibrating through the corridors like disturbed hornets. You can't control a team until you control their central nervous systems.

15

1K

60

119

145K

0

1

0

21

DerekColley_ retweeted

Framework

@FrameworkPuter

1 day ago

We spent the last few months overhauling our logistics infrastructure around refurbishment, and today we launched a broad set of refurbished products into the Framework Outlet. We'll be able to turn around customer returns into refurbs faster now too!

FrameworkPuter's tweet photo. We spent the last few months overhauling our logistics infrastructure around refurbishment, and today we launched a broad set of refurbished products into the Framework Outlet. We'll be able to turn around customer returns into refurbs faster now too! https://t.co/oLKezPSew9

22

739

27

44

22K

Derek Colley

@DerekColley_

about 18 hours ago

By default, Aphrodite runs one model per server instance. Enable multi-model mode - a single API server can load and serve multiple independent models simultaneously. Each additional model gets its own engine/worker (with some memory overhead, roughly ~1GB extra per model for the CUDA context). APHRODITE_SERVER_DEV_MODE=1 \ APHRODITE_ENABLE_DYNAMIC_KV_CACHE=1 \ APHRODITE_ENABLE_MULTI_MODEL=1 \ aphrodite run <first-model> \ --enable-inline-model-loading \ [other flags like --max-model-len, --gpu-memory-utilization, etc.]

Lotto

@LottoLabs

1 day ago

Why am I just learning about this inference engine Seems like a finetunes vllm with gguf support, TQ, spec decode etc. https://t.co/2tHRSBPMqc

9

83

5

82

5K

0

22

DerekColley_ retweeted

Lotto

@LottoLabs

1 day ago

Why am I just learning about this inference engine Seems like a finetunes vllm with gguf support, TQ, spec decode etc. https://t.co/2tHRSBPMqc

9

83

5

82

5K

Derek Colley

@DerekColley_

about 19 hours ago

@SlimTradeyBaby Ha, got it straight away. These are DGX-2 seedlings. Nice! In a couple months you can harvest them -😋

0

1

0

33

Derek Colley

@DerekColley_

about 19 hours ago

You can just do stuff! @Openrouter has a number of "*-free" models. Yes, sometimes they are rate-limited, but amongst the slightly lesser models you should manage to find one that works for you. Just start

Lotto

@LottoLabs

1 day ago

For all the gpu poor and actually just poor bros out there Kimi k2.6 has a free endpoint on openrouter that’s 120tps Does this thing just have really low limits or what?

LottoLabs's tweet photo. For all the gpu poor and actually just poor bros out there

Kimi k2.6 has a free endpoint on openrouter that’s 120tps

Does this thing just have really low limits or what? https://t.co/qjOogFx8Db

19

124

3

51

13K

0

17

Derek Colley

@DerekColley_

about 20 hours ago

Agent Context Compression is the new GC...

0

5

DerekColley_ retweeted

Saint Nomad

@Saint_n0mad

1 day ago

someone 3D printed a full mini ITX PC case for $18 a file, a printer, and you pick the color customization that no factory offers perks of having 3D printer

Saint_n0mad's tweet photo. someone 3D printed a full mini ITX PC case for $18

a file, a printer, and you pick the color customization that no factory offers

perks of having 3D printer https://t.co/ISNIqNC73q

28

2K

180

949

121K

Derek Colley

@DerekColley_

about 20 hours ago

My current goal on X is to get my follower count to 1000+ to be eligible for the monetisation. Rather than just posting stuff - I still do - I am trying to drip content to see if it makes any difference to my engagement stats. I started using @buffer to schedule my tweets. Today I discovered the thread feature (yes, I have a lot to learn...) - Add Tweet button I think threads are better than long form - ppl (on X) seem to have lost the patience to consume long form.

DerekColley_'s tweet photo. My current goal on X is to get my follower count to 1000+ to be eligible for the monetisation.

Rather than just posting stuff - I still do - I am trying to drip content to see if it makes any difference to my engagement stats.

I started using @buffer to schedule my tweets.

Today I discovered the thread feature (yes, I have a lot to learn...)
- Add Tweet button

I think threads are better than long form - ppl (on X) seem to have lost the patience to consume long form.

0

9

Derek Colley

@DerekColley_

about 21 hours ago

@mr_r0b0t @perplexity_ai I think Cursor also does this

1

0

15

Derek Colley

@DerekColley_

1 day ago

@80s_channel Ah yes, but can you hum the theme tunes?

0

156

DerekColley_ retweeted

stevibe

@stevibe

1 day ago

Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together. (The test: place each element at the right pixel position on a blank form image, not type into a field.) Setup: > Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool). > I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height. > The blue boxes on the screen are its detections. Look how tight they are — it nails every field. Result: > Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas. > Character-box alignment still a touch loose, but every value is where it belongs. > 9m10s, 224.5k input, 24.3k output, 21 turns. Why it matters: > Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can. > A combination of small models can do the work of a single large one.

82

2K

268

3K

139K

Derek Colley

@DerekColley_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users