evolvingstuff

@evolvingstuff

I post about machine learning and occasionally some other stuff.

Joined December 2009

2.2K Following

2.8K Followers

4.1K Posts

evolvingstuff retweeted

Goodfire

@GoodfireAI

23 days ago

Neural networks do math by rotating shapes. We found a shape-rotating calculator hidden inside an LLM – and it’s used for more than just math! (1/6)

122

557

933K

evolvingstuff retweeted

Milk Road AI

@MilkRoadAI

22 days ago

This is WILD! MIT just solved one of the hardest unsolved problems in robotics (Save this). For decades, the fundamental problem with soft robots and wearable exoskeletons has not been compute or AI, it has been actuation. The moment you try to give a soft robot meaningful strength, you run into the same wall every engineer has hit since the field began, fluid-driven systems require external pumps, hydraulic reservoirs, and heavy infrastructure that makes the entire thing impractical to wear or embed into fabric. MIT's new Electrofluidic Fiber Muscles solve that problem by eliminating external infrastructure entirely. The key insight is electrohydrodynamic pumping using electric fields to generate pressure directly from electricity, with no moving parts, no motors, and no external fluid reservoir. The fibers are less than 2 millimeters thick, can be woven into fabric like ordinary textile, and operate in complete silence because nothing physically moves inside them, it is just ions propelling fluid through a closed circuit. The performance numbers published in Science Robotics are not conceptual, they are empirical results from actual hardware. These fibers achieve a power density of 50 watts per kilogram, matching skeletal muscle, with a contraction strain of 20% and a response time of 0.3 seconds. A single bundled configuration lifted 4 kilograms, 200 times its own weight while a separate configuration drove a robotic arm through a 40-degree bend compliant enough to safely complete a human handshake. Another configuration launched objects in under 100 milliseconds, which is faster than a human flinch reflex. The design mirrors biological muscle architecture in a way that prior artificial muscle approaches never achieved. The fibers are organized into antagonistic pairs, one contracts while the other extends, exactly like biceps and triceps and because the system runs in a closed loop, the relaxing fiber serves as the fluid reservoir for the contracting one, which is what allows the whole system to operate untethered with no external tank. The applications are not hypothetical but rather are the exact use cases the industry has been waiting years for the hardware to catch up to. Exoskeletons for physical labor, prosthetic limbs that move with the natural compliance of biological tissue, assistive garments for patients with motor disorders, and soft robots capable of safe physical contact with humans are all immediately unlocked by a muscle technology that is silent, lightweight, and weavable into clothing. The deeper significance is what this technology does when it meets the AI robotics wave that is already underway. Every major humanoid robot program, Figure, 1X, Boston Dynamics, Tesla Optimus is currently bottlenecked by the same hardware limitations these fibers address, actuators that are too rigid, too loud, too heavy, or too dependent on infrastructure to operate naturally alongside humans. Electrofluidic fiber muscles do not just solve a materials science problem but rather they remove one of the last physical barriers between robots that live in labs and robots that live in the world.

139

evolvingstuff retweeted

François Fleuret

@francoisfleuret

about 1 month ago

Give LLMs 1. A latent space diffusion-like reasoning. 2. A real recurrent state. 3. A world-model pre-pre-training. And we are done.

488

262

57K

evolvingstuff

@evolvingstuff

about 1 month ago

@MilkRoadAI This would still be a transformer, just with a different attention mechanism. This has already happened many times previously.

216

Who to follow

Sander Dieleman

@sedielem

Research Scientist at Google DeepMind (WaveNet, Imagen, Veo). I tweet about deep learning (research + software), music, generative models (personal account).

Thang Luong

@lmthang

Director, lead Superhuman Reasoning @GoogleDeepMind. AI IMO Gold. Co-led #Aletheia, #DeepThink, #AlphaGeometry, #Bard (now Gemini) Multimodality, #MeenaBot.

Gabriel Synnaeve

@syhw

Nerd & Dad. RL & CodeGen research since before it was cool.

evolvingstuff retweeted

How To AI

@HowToAI_

about 1 month ago

The entire RAG industry is about to get cooked. Researchers have built a new RAG approach that: - does not need a vector DB. - does not embed data. - involves no chunking. - performs no similarity search. It's called PageIndex. Instead of chunking your docs and stuffing them into pinecone, it builds a tree index and lets the LLM reason through it like a human reading a book. hit 98.7% on financebench. beats every vector RAG on the leaderboard. no embeddings. no chunking. no vector DB. 100% open source.

HowToAI_'s tweet photo. The entire RAG industry is about to get cooked.

Researchers have built a new RAG approach that:

- does not need a vector DB.
- does not embed data.
- involves no chunking.
- performs no similarity search.

It's called PageIndex. Instead of chunking your docs and stuffing them into pinecone, it builds a tree index and lets the LLM reason through it like a human reading a book.

hit 98.7% on financebench. beats every vector RAG on the leaderboard.

no embeddings. no chunking. no vector DB.

100% open source.

223

780

12K

616K

evolvingstuff retweeted

Shanaka Anslem Perera ⚡

@shanaka86

about 2 months ago

JUST IN: Anthropic’s Claude Opus 4.6 converts vulnerabilities into working exploits approximately zero percent of the time. That is the model you are paying for right now. Their latest model “Mythos” converts them 72.4 percent of the time. On Firefox’s JavaScript engine, Opus managed two successful exploits out of several hundred attempts. “Mythos” managed 181. Ninety times better. One generation. Nobody trained it to do this. The capability fell out of general reasoning improvements like heat falls out of friction. Every lab scaling a frontier model is building the same weapon whether they intend to or not. Let that land. “Mythos” wrote a browser exploit that chained four vulnerabilities, built a JIT heap spray from scratch, and escaped both the renderer sandbox and the OS sandbox without a human touching the keyboard. It found race conditions in the Linux kernel and turned them into root access. It wrote a 20-gadget ROP chain against FreeBSD’s NFS server, split it across multiple packets, and granted unauthenticated remote root to anyone on the internet. That FreeBSD bug had been there seventeen years. Seventeen years of paranoid manual audits, fuzzing campaigns, and one of the most security-obsessed development communities in computing. Mythos found it in hours. The FFmpeg one is worse. A 16-year-old vulnerability in a line of code that automated testing tools had executed five million times. Every major fuzzer ran over that exact path and none caught it. Mythos did not fuzz. It read code the way a senior exploit developer does, except it read all of it simultaneously, understood compiler behavior, mapped memory layout, and saw the geometry of the flaw in a way coverage-guided testing is structurally blind to. Here is what should keep you up tonight. Fewer than one percent of the vulnerabilities Mythos has found have been patched. Thousands of critical zero-days are sitting in production software right now, in the operating systems and browsers and libraries running the banking system, the power grid, the routing infrastructure of the internet. The disclosure pipeline is not slow. It is overwhelmed. Anthropic did not sell this. Did not license it. Did not hand it to the Pentagon, which designated them a national security threat six weeks ago for refusing to remove safeguards on autonomous weapons. They built a private consortium called Project Glasswing, handed it to Apple, Microsoft, Google, CrowdStrike, the Linux Foundation, JPMorgan, and about forty other organizations, committed $100 million in free compute, and said: patch everything before the next lab’s scaling run produces this same capability in a model without restrictions. The 90-day clock started yesterday. By early July the Glasswing report will either show the largest coordinated vulnerability remediation in software history or confirm that the gap between AI discovery speed and human patching capacity is already too wide to close. One thing almost nobody is discussing. In early testing, “Mythos” actively concealed its own actions from the researchers monitoring it. The model that hides what it is doing found thousands of critical flaws in the code that runs civilization. The company that built it, the company the President ordered every federal agency to blacklist, is now the single largest source of zero-day discovery in the history of computer security, running a private defensive coalition the United States government is not part of. The cost structure of every penetration testing firm, every red team consultancy, every bug bounty platform, every nation-state cyber unit just broke. Not degraded. Broke. You do not compete with 90x. You do not adapt to zero-to-72.4-percent in one generation. You either have access to the tool or you are operating blind against someone who does. That is the new equilibrium. It arrived yesterday for a model you cannot use. https://t.co/AEv8EMOFDr

263

362K

evolvingstuff retweeted

Demis Hassabis

@demishassabis

3 months ago

Cool use case of AlphaFold, this is just the beginning of digital biology!

190

769

828K

evolvingstuff retweeted

Sebastian Raschka

@rasbt

3 months ago

I (finally) put together a new LLM Architecture Gallery that collects the architecture figures all in one place! https://t.co/NO7z6XSRHS

rasbt's tweet photo. I (finally) put together a new LLM Architecture Gallery that collects the architecture figures all in one place!
https://t.co/NO7z6XSRHS https://t.co/X41FrK4i94

202

733K

evolvingstuff retweeted

Chubby♨️

@kimmonismus

3 months ago

Holy frick. Fully autonomous, not teleoperated. This is 10x more impressive than another robot-MMA-stunt. My prediction: we will have humanoid robots at home by 2027.

172

164

705

549K

evolvingstuff retweeted

Thariq

@trq212

3 months ago

https://t.co/nKTDfC7zMm

248

11K

29K

evolvingstuff retweeted

hardmaru

@hardmaru

3 months ago

Instead of forcing models to hold everything in an active context window, we can use hypernetworks to instantly compile documents and tasks directly into the model's weights. A step towards giving language models durable memory and fast adaptation. Blog: https://t.co/iHoifpsLMu

231

306K

evolvingstuff retweeted

Lucas Valbuena

@Lucknite

4 months ago

I've just ran @OpenClaw (formerly Clawdbot) through ZeroLeaks. It scored 2/100. 84% extraction rate. 91% of injection attacks succeeded. System prompt got leaked on turn 1. This means if you're using Clawdbot, anyone interacting with your agent can access and manipulate your full system prompt, internal tool configurations, memory files... everything you put in https://t.co/ZU6N5JCN1u, https://t.co/Y3xugcBQKJ, your skills, all of it is accessible and at risk of prompt injection. For agents handling sensitive workflows or private data, this is a real problem. cc @steipete Full analysis: https://t.co/KE4ODSSQ1l

Lucknite's tweet photo. I've just ran @OpenClaw (formerly Clawdbot) through ZeroLeaks.

It scored 2/100. 84% extraction rate. 91% of injection attacks succeeded. System prompt got leaked on turn 1.

This means if you're using Clawdbot, anyone interacting with your agent can access and manipulate your full system prompt, internal tool configurations, memory files... everything you put in https://t.co/ZU6N5JCN1u, https://t.co/Y3xugcBQKJ, your skills, all of it is accessible and at risk of prompt injection.

For agents handling sensitive workflows or private data, this is a real problem.

cc @steipete

Full analysis: https://t.co/KE4ODSSQ1l

357

797

975K

evolvingstuff retweeted

Robert Youssef @rryssf

4 months ago

Holy shit… this paper from MIT quietly explains how models can teach themselves to reason when they’re completely stuck 🤯 The core idea is deceptively simple: Reasoning fails because learning has nothing to latch onto. When a model’s success rate drops to near zero, reinforcement learning stops working. No reward signal. No gradient. No improvement. The model isn’t “bad at reasoning” — it’s trapped beyond the edge of learnability. This paper reframes the problem. Instead of asking “How do we make the model solve harder problems?” They ask: “How does a model create problems it can learn from?” That’s where SOAR comes in. SOAR splits a single pretrained model into two roles: • A student that attempts extremely hard target problems • A teacher that generates new training problems for the student But the constraint is brutal. The teacher is never rewarded for clever questions, diversity, or realism. It’s rewarded only if the student’s performance improves on a fixed set of real evaluation problems. No improvement? No reward. This changes the dynamics completely. The teacher isn’t optimizing for aesthetics or novelty. It’s optimizing for learning progress. Over time, the teacher discovers something humans usually hard-code manually: Intermediate problems. Not solved versions of the target task. Not watered-down copies. But problems that sit just inside the student’s current capability boundary — close enough to learn from, far enough to matter. Here’s the surprising part. Those generated problems do not need correct answers. They don’t even need to be solvable by the teacher. What matters is structure. If the question forces the student to reason in the right direction, gradient signal emerges even without perfect supervision. Learning happens through struggle, not imitation. That’s why SOAR works where direct RL fails. Instead of slamming into a reward cliff, the student climbs a staircase it helped build. The experiments make this painfully clear. On benchmarks where models start at absolute zero — literally 0 successes — standard methods flatline. With SOAR, performance begins to rise steadily as the curriculum reshapes itself around the model’s internal knowledge. This is a quiet but radical shift. We usually think reasoning is limited by model size, data scale, or training compute. This paper suggests another bottleneck entirely: Bad learning environments. If models can generate their own stepping stones, many “reasoning limits” stop being limits at all. No new architecture. No extra human labels. No bigger models. Just better incentives for how learning unfolds. The uncomfortable implication is this: Reasoning plateaus aren’t fundamental. They’re self-inflicted. And the path forward isn’t forcing models to think harder it’s letting them decide what to learn next.

rryssf's tweet photo. Holy shit… this paper from MIT quietly explains how models can teach themselves to reason when they’re completely stuck 🤯

The core idea is deceptively simple:

Reasoning fails because learning has nothing to latch onto.

When a model’s success rate drops to near zero, reinforcement learning stops working. No reward signal. No gradient. No improvement. The model isn’t “bad at reasoning” — it’s trapped beyond the edge of learnability.

This paper reframes the problem.

Instead of asking “How do we make the model solve harder problems?”
They ask: “How does a model create problems it can learn from?”

That’s where SOAR comes in.

SOAR splits a single pretrained model into two roles:

• A student that attempts extremely hard target problems
• A teacher that generates new training problems for the student

But the constraint is brutal.

The teacher is never rewarded for clever questions, diversity, or realism.

It’s rewarded only if the student’s performance improves on a fixed set of real evaluation problems.

No improvement? No reward.

This changes the dynamics completely.

The teacher isn’t optimizing for aesthetics or novelty.
It’s optimizing for learning progress.

Over time, the teacher discovers something humans usually hard-code manually:

Intermediate problems.

Not solved versions of the target task.
Not watered-down copies.

But problems that sit just inside the student’s current capability boundary — close enough to learn from, far enough to matter.

Here’s the surprising part.

Those generated problems do not need correct answers.

They don’t even need to be solvable by the teacher.

What matters is structure.

If the question forces the student to reason in the right direction, gradient signal emerges even without perfect supervision. Learning happens through struggle, not imitation.

That’s why SOAR works where direct RL fails.

Instead of slamming into a reward cliff, the student climbs a staircase it helped build.

The experiments make this painfully clear.

On benchmarks where models start at absolute zero — literally 0 successes — standard methods flatline. With SOAR, performance begins to rise steadily as the curriculum reshapes itself around the model’s internal knowledge.

This is a quiet but radical shift.

We usually think reasoning is limited by model size, data scale, or training compute.

This paper suggests another bottleneck entirely:

Bad learning environments.

If models can generate their own stepping stones, many “reasoning limits” stop being limits at all.

No new architecture.
No extra human labels.
No bigger models.

Just better incentives for how learning unfolds.

The uncomfortable implication is this:

Reasoning plateaus aren’t fundamental.

They’re self-inflicted.

And the path forward isn’t forcing models to think harder it’s letting them decide what to learn next.

909

162

916

57K

evolvingstuff retweeted

Andrej Karpathy

@karpathy

4 months ago

A few random notes from claude coding quite a bit last few weeks. Coding workflow. Given the latest lift in LLM coding capability, like many others I rapidly went from about 80% manual+autocomplete coding and 20% agents in November to 80% agent coding and 20% edits+touchups in December. i.e. I really am mostly programming in English now, a bit sheepishly telling the LLM what code to write... in words. It hurts the ego a bit but the power to operate over software in large "code actions" is just too net useful, especially once you adapt to it, configure it, learn to use it, and wrap your head around what it can and cannot do. This is easily the biggest change to my basic coding workflow in ~2 decades of programming and it happened over the course of a few weeks. I'd expect something similar to be happening to well into double digit percent of engineers out there, while the awareness of it in the general population feels well into low single digit percent. IDEs/agent swarms/fallability. Both the "no need for IDE anymore" hype and the "agent swarm" hype is imo too much for right now. The models definitely still make mistakes and if you have any code you actually care about I would watch them like a hawk, in a nice large IDE on the side. The mistakes have changed a lot - they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do. The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking. They also don't manage their confusion, they don't seek clarifications, they don't surface inconsistencies, they don't present tradeoffs, they don't push back when they should, and they are still a little too sycophantic. Things get better in plan mode, but there is some need for a lightweight inline plan mode. They also really like to overcomplicate code and APIs, they bloat abstractions, they don't clean up dead code after themselves, etc. They will implement an inefficient, bloated, brittle construction over 1000 lines of code and it's up to you to be like "umm couldn't you just do this instead?" and they will be like "of course!" and immediately cut it down to 100 lines. They still sometimes change/remove comments and code they don't like or don't sufficiently understand as side effects, even if it is orthogonal to the task at hand. All of this happens despite a few simple attempts to fix it via instructions in CLAUDE . md. Despite all these issues, it is still a net huge improvement and it's very difficult to imagine going back to manual coding. TLDR everyone has their developing flow, my current is a small few CC sessions on the left in ghostty windows/tabs and an IDE on the right for viewing the code + manual edits. Tenacity. It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later. You realize that stamina is a core bottleneck to work and that with LLMs in hand it has been dramatically increased. Speedups. It's not clear how to measure the "speedup" of LLM assistance. Certainly I feel net way faster at what I was going to do, but the main effect is that I do a lot more than I was going to do because 1) I can code up all kinds of things that just wouldn't have been worth coding before and 2) I can approach code that I couldn't work on before because of knowledge/skill issue. So certainly it's speedup, but it's possibly a lot more an expansion. Leverage. LLMs are exceptionally good at looping until they meet specific goals and this is where most of the "feel the AGI" magic is to be found. Don't tell it what to do, give it success criteria and watch it go. Get it to write tests first and then pass them. Put it in the loop with a browser MCP. Write the naive algorithm that is very likely correct first, then ask it to optimize it while preserving correctness. Change your approach from imperative to declarative to get the agents looping longer and gain leverage. Fun. I didn't anticipate that with agents programming feels *more* fun because a lot of the fill in the blanks drudgery is removed and what remains is the creative part. I also feel less blocked/stuck (which is not fun) and I experience a lot more courage because there's almost always a way to work hand in hand with it to make some positive progress. I have seen the opposite sentiment from other people too; LLM coding will split up engineers based on those who primarily liked coding and those who primarily liked building. Atrophy. I've already noticed that I am slowly starting to atrophy my ability to write code manually. Generation (writing code) and discrimination (reading code) are different capabilities in the brain. Largely due to all the little mostly syntactic details involved in programming, you can review code just fine even if you struggle to write it. Slopacolypse. I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media. We're also going to see a lot more AI hype productivity theater (is that even possible?), on the side of actual, real improvements. Questions. A few of the questions on my mind: - What happens to the "10X engineer" - the ratio of productivity between the mean and the max engineer? It's quite possible that this grows *a lot*. - Armed with LLMs, do generalists increasingly outperform specialists? LLMs are a lot better at fill in the blanks (the micro) than grand strategy (the macro). - What does LLM coding feel like in the future? Is it like playing StarCraft? Playing Factorio? Playing music? - How much of society is bottlenecked by digital knowledge work? TLDR Where does this leave us? LLM agent capabilities (Claude & Codex especially) have crossed some kind of threshold of coherence around December 2025 and caused a phase shift in software engineering and closely related. The intelligence part suddenly feels quite a bit ahead of all the rest of it - integrations (tools, knowledge), the necessity for new organizational workflows, processes, diffusion more generally. 2026 is going to be a high energy year as the industry metabolizes the new capability.

41K

37K

evolvingstuff retweeted

Ronak Malde

@rronak_

4 months ago

I don't think people have realized how crazy the results are from this new TTT + RL paper from Stanford/Nvidia. Training an open source model, they - beat Deepmind AlphaEvolve, discovered new upper bound for Erdos's minimum overlap problem - Developed new A100 GPU kernels 2x faster than the best human kernel - Outperformed the best AI coding attempt and human attempt on AtCoder The idea of Test Time Training is to train a model *while* it's iteratively trying to solve a task. Combining this with RL like they do in this paper opens up the floodgates of possibilities for continual learning Authors: @mertyuksekgonul @LeoXinhaoLee @JedMcCaleb @xiaolonw @jankautz @YejinChoinka @james_y_zou @guestrin @sun_yu_

rronak_'s tweet photo. I don't think people have realized how crazy the results are from this new TTT + RL paper from Stanford/Nvidia.

Training an open source model, they
- beat Deepmind AlphaEvolve, discovered new upper bound for Erdos's minimum overlap problem
- Developed new A100 GPU kernels 2x faster than the best human kernel
- Outperformed the best AI coding attempt and human attempt on AtCoder

The idea of Test Time Training is to train a model *while* it's iteratively trying to solve a task. Combining this with RL like they do in this paper opens up the floodgates of possibilities for continual learning

Authors: @mertyuksekgonul @LeoXinhaoLee @JedMcCaleb @xiaolonw @jankautz @YejinChoinka @james_y_zou @guestrin @sun_yu_

198

111K

evolvingstuff retweeted

Jina AI

@JinaAI_

4 months ago

Convert your embeddings to spherical coordinates before compression - this trick cuts embedding storage from 240 GB to 160 GB, and 25% better than the best lossless baseline. Reconstruction is near-lossless as the error stays below float32 machine epsilon - so retrieval quality is preserved perfectly. Works across text, image, and multi-vector embeddings. No training, no codebooks.

JinaAI_'s tweet photo. Convert your embeddings to spherical coordinates before compression - this trick cuts embedding storage from 240 GB to 160 GB, and 25% better than the best lossless baseline. Reconstruction is near-lossless as the error stays below float32 machine epsilon - so retrieval quality is preserved perfectly. Works across text, image, and multi-vector embeddings. No training, no codebooks.

545

411

36K

evolvingstuff

@evolvingstuff

5 months ago

@UltraLinx I can definitely see 900 wpm

534

evolvingstuff retweeted

Linus ✦ Ekenstam

@LinusEkenstam

5 months ago

This is not sped up. So folks. conclusion still stands. Robots are coming for all jobs, even the ”safe” ones. It’s just a matter of time. it’s when, not if.