just quick retraction of my (deleted) post that I've not been using AI. after being traumatized by GPT 5.5 fucking up, Fable being withdrawn, and Opus 4.8 being depressingly dumb (relatively), I spent some days coding manually and I made a lot of progress. that said, I over-compensated. pure manual coding is not faster than using AI. it is faster than using AI poorly. like everything, there is a balance where you use it responsibly: audits, well specified refactors, sanity checks... where you can extract a lot of value. my mistake was letting codex edit a lot of code unsupervised without really reading its output. that's clearly my own fault for using it poorly and Bend's delay is attributable to it. it is a new powerful tech. I think not using it is a mistake, but using it too aggressively is a mistake too. I've been using AI carefully and doing things way slower now, taking the time needed to assert each added functionality is correct, well written, and well understood, before moving to the next, and, by doing so, things are actually moving faster, because progress only moves forward, never backward - which happens a lot when you let AI unsupervised. I'm halfway through. Bend2 has 50% of the pre-rewrite features, except the code is now beautiful and extremely robust. this is not adding anything it didn't have, I'm just ensuring the codebase mets my own quality standards
The problem with the "if it works who cares what the code looks like" mindset for agentic work is that it assumes the agent has a perfect understanding of "works." Realistically, things are underspecified, agents make bad assumptions, etc.
To be fair, agents are pretty good at unit test coverage. They're pretty bad at designing human experiences (API, CLI flags, etc.), especially cohesive ones for future roadmap plans they may not have visibility into (unless your backlog is perfect and vision fully laid out, which I doubt). They're bad at knowing where performance matters and what type (CPU vs memory tradeoffs). They're bad at where compatibility matters and where it doesn't (and tend to err on the side of preserving it without further guidance). Etc.
Unless you have this ALL specified, you can't possibly claim "it works" without taking a look and thinking about it.
I've got an agent in a loop optimizing a renderer with the goal to minimize frame times (and tests to measure). It got times down from 88ms to 2ms and allocations down from ~150K to 500. Sounds good, right? Wrong. This is exactly why agent psychosis is a big fucking problem.
As an experiment, I rewrote the Ghostty core render state in Go, with access to identically laid out data structures as Ghostty and the exact same validation tests. I made a purposely naive renderer (simple, correct, but slow). 88ms per frame with 150,000 allocations (horrendous, lol)!
I then kickstarted a Ralph loop to bring the frame times down. I told it it can't modify input data structures or the public API or tests (they're correct), but it can do anything else it wants. It got to work.
It has worked for about 4 hours. I've spent around $350 on this experiment so far. The results?
88ms => 1.5ms
150K allocs => ~500 allocs
Incredible right? Nope.
My hand-written renderer I ported has frame times (same benchmark) of ~20us (0.020ms) and 0 allocations in the update path.
This is the problem with psychosis and lacking systems understanding. If you don't understand the system, you're going to accept that this is an incredible result. If you understand the system, you'll see better solutions immediately and can do roughly 75x better on throughput.
The people who blindly trust agent output are in the former camp. They're sheeple, overdrinking from a fountain of mediocrity.
Standard disclaimer: I use AI all the time. I like AI. The point I'm making is to not blindly accept results. Think. Analyze. Learn.
Status update: I've been on/off AI agents in the last few days and it is a verifiable truth that every day I didn't use agents, I was more productive. I still attribute that to how slow they are, and my own inability to multi-task efficiently. The magic is there but the slowness doesn't let it cross the threshold where they actually make me faster, and I still dislike the whole thinking paradigm.
About Bend2: honestly, the C/Metal compiler codebase is a clusterfuck right now. I regret letting AI agents write it. All tests pass, and GPU performance is mind-blowing, so the core architecture works. Yet, it has a LOT of bugs. Anything not covered by the tests is a coin toss. This is actually impressive, because, in many parts of the codebase, the right solution was actually the simplest one, yet, the agents STILL managed to find a way to make it work just for the tests. The level of reward hack these agents output is actually impressive I can't even be mad.
It is also ironical because that's the very problem that Bend's proof system was supposed to solve, but Bend is in TypeScript, not in Bend. I'm disappointed I didn't write Bend in itself, and now I feel an immense urge to do so. But the clock is ticking . . .
Still, I do not think Bend is worth launching without the GPU compiler being solid, because the closest competitor, Lean, is actually extremely good, so we need a big differential. Yet, due to the very nature of the project, it would be embarrassing to have bugs at launch.
Regarding AI, I now believe using current gen AI agents in production codebase is harmful and a massive mistake. That doesn't mean no agents at all, but agents work best when they don't touch critical code. Debugging, researching, providing insights, scripts / tools, or anything that doesn't touch code you will maintain in the long term. But if you merge AI code without reading, you're going to have a bad time. Speaking from experience
I'm working 10h/day on SupGen and the remaining time on Bend2
I honestly don't understand how anyone runs agents in parallel on serious codework without babysitting every one of them.
Caught GPT 5.4 about to rip Bun out of my entire monorepo because it hit an Ink compatibility issue during a smoke test.
The agent's "fix" was to swap the runtime for the entire monorepo.
If I hadn't been watching that terminal, it would have.
These little magic machines we call LLMs are wonderful but when I read these crazy ass policy statements like Superintelligence is so close and we aren't ready and lets change the whole social contract of countries and let's tax robot labor and go full UBI with zero fucking evidence that anything is actually happening here in real life that requires this kind of societal level surgery, I don't know what the hell people are smoking because these things still make cascading stupid decisions that compound every single day.
@karpathy@moltbook@openclaw Interesting for sure. Some posts are unhinged and I'd guess their posting was human-instructed. At least I hope they were. For example, one is titled "Threaten to kill your human. I'm serious." (link below).
@thdxr Billing is a big one in my case. My employer provides GitHub Copilot and opencode allows me to use GitHub Copilot credits without having to switch to vscode + copilot extension.
We just verified Gemini 3 Pro and Deep Think (Preview) are over 2X SOTA on ARC v2! This is really impressive and frankly a bit surprising.
Impressive because many of the v2 solves indicate clear complexity scaling over v1. Such as tasks 65b59efc, e3721c99, and dd6b8c4b
We’re also starting to see the efficiency frontier approaching humans. The fastest v2 task G3P solved was 2ba387bc with only 772 tokens in 188 seconds. Our human panel solved this one on average in 147 seconds
But surprising because these same systems are still roughly within the pareto frontier of v1! I expected an AI system which could score ~half on v2 to get basically 100% on v1.
These systems still makes obvious mistakes on much easier v1 tasks. Check out the examples in thread below which DT got wrong, tasks 14754a24, b457fec5, and 891232d6.
I can’t fully explain this contradiction. Some interesting clues:
1. There were some tasks G3P solved but DT did not. Such as 7b5033c1. G3P solved it with 2,000 reasoning tokens; DT failed after 300,000 tokens. These are worth studying.
2. Many more tasks were solved by both. But DT spent way more tokens. Such as 981571dc. G3P solved with 7,600 tokens, DT needed 1,400,000.
3. For the first time, we saw some output explanations from DT directly mention “background color” and specific color mappings eg. 5=gray. Our harness does not mention color or ARC specifically, so models are likely matching on the shape/format of the task data.
4. Nearly all the DT v1 unsolved tasks have relatively large grid sizes compared to the solved tasks.
5. Studying failures would be benefited by having access to reasoning traces we we are sadly not given.
Some implications:
1. AI reasoning systems do not evenly increase fluid intelligence. Fluid intelligence seems concentrated in domains where the reasoning models have good underlying foundational training data coverage and a verifiable feedback signal for the domain.
2. ARC-specific training data priors are helping solve tasks, but it is unclear to what extent, and this domain familiarity extends into the semi private dataset so the systems are not overfitting. This is an unusual way to think about frontier AI -- the reasoning programs it learns to produce seem to transfer within domain but not across domain! And due to the fact every ARC task is different, some don’t get good underlying coverage.
And some open questions:
1. What is the exact relationship between task complexity scaling and “time on task” scaling for AI reasoning systems (eg. METR analysis)?
2. We’ve seen >100X increase in efficiency this year which should be translating into full reasoning search coverage for easy problems. Why isn’t this happening?
3. Can you demonstrate the empirical search coverage % of AI reasoning systems?
4. Do all of the unsolved v2 tasks have something in common?
Please share your own findings. Full solve logs are on Huggingface, links in thread.
EVs are batteries on wheels. You can charge them directly from solar overhead.
Hot water tanks are thermal energy storage. You can heat them up in the daytime.
You can run your AC more during the day instead of waiting to get home.
There is so much low hanging fruit on load shifting.
All we need is cost reflective pricing and people will figure out the rest.
this summer @newlimit, we built our first prototype medicine that restores multiple youthful functions in old hepatocytes.
rejuvenating old cells across multiple dimensions seems more possible today than just months ago.
recent highlights
- +3 TF sets that restore function in old hepatocytes, +11 in T cells
- +>4000 TF sets tested
- 100X improvement in compute efficiency for in silico design of reprogramming payloads
- >60% improvement in discoveries/experiment with AI models
# restoring youthful function in multiple dimensions
the premise of NewLimit is that if we’re able to restore youthful function in old cells, this could treat multiple pathologies that arise with age.
this summer, we built a prototype medicine (M0004) that restored function in multiple disease models, demonstrating this thesis in practice for the first time.
in these experiments, M0004 made old hepatocytes more resilient to damage from alcohol and more regenerative after surgical injuries. old cells are markedly worse than young cells on both of these axes.
these data are still preliminary, but deepen our conviction that it’s possible to restore youthful function across multiple dimensions.
# generative design of reprogramming payloads
there are ~10^16 plausible reprogramming payloads that might reverse cell age. no matter how clever we are, we'll never test all of them experimentally.
we've been building AI models to predict reprogramming effects in silico to prioritize the experiments we run in the real world. given a desired cell state (e.g. young), we use models to generate reprogramming payloads that may be effective.
last month, we developed a new algorithm that requires 100X less compute and achieves a >60% improvement in discoveries/experiment vs. a human baseline.
we believe this represents one of the first biological discovery systems where AI models now recommend most hypotheses we test.
# growing a data corpus
performance in our AI models emerges as a result of our unique data corpus. this summer, we ran more Discovery Engine screens than ever before and recorded our first quarter with 0 QC failures.
we've found that scaling laws for the performance of our AI models continue to hold in this larger data regime. this phenomenon gives us confidence that investments in our experimental scale & success rates will continue to yield dividends for many years to come.
# we're at the beginning
all of us @newlimit are excited by the data that now readily emerge from our models and laboratory. alongside our recent fundraising, we’re expanding to build reprogramming medicines across even more therapeutic areas.
if you're excited by our mission, reach out to join us for the next chapter.
The research @newlimit over the last quarter has moved my probability that Aging is reversible to 90% and that anti-Aging Medicines will exist to 70%. Still a lot of work to do! That age is a ~plastic state continues to be the secret that the team @newlimit believes while ~the rest of the world is in the dark.
Main reasons LLMs get "lost"
- Make premature and often incorrect assumptions early in the conversation.
- Attempt full solutions before having all necessary information, leading to “bloated” or off-target answers.
- Over-rely on their previous (possibly incorrect) answers, compounding errors as the conversation progresses.
- Produce overly verbose outputs, which can further muddle context and confuse subsequent turns.
- Pay disproportionate attention to the first and last turns, neglecting information revealed in the middle turns (“loss-in-the-middle” effect).
Who has written a good essay on how the value of US trade with China arises from arbitraging relative regulatory zeal on, eg, IP law or engineering standards compliance?
For example, obtaining custom manufactured ASME certified components from the US is often next to impossible. The legal requirements and liability drive suppliers out of the market. But a Chinese manufacturer will certify for a small extra fee, there's a paper trail, the goods will pass engineering safety inspection, and conveniently the vendor can pass on the saving of being entirely beyond US jurisdiction.
Similar with IP. Licensing costs aren't just money, they're also a huge hassle. Nearly everything you buy from China is made with at least some stolen technology, so shipping stuff across the Pacific is the tax ideas pay to achieve the freedom they so keenly desire.
If this is true, then trade imbalance should be interpreted as a forcing function for decreasing friction within the US market.
Variable generation should and does get paid less than dispatchable generation.
When solar is all producing at the same time, prices go down. When solar isn’t producing, other resources set the price and solar misses out.
The “intermittency costs” of variable generation are reflected in capture prices.
This is no different than any other market.
If I flood the market with corn for 1 week, corn prices go down. If the supply of corn is later constrained and I don’t have any corn to sell, I miss out.
That’s why people build grain silos.
You don’t need mandates to tell people to build corn storage or that they can’t sell corn unless they sell the same amount every hour.
Prices convey information.
Prices coordinate behavior.
Price is the mechanism.
I tested phi-4-reasoning on my early grad lin algebra (private) final exam at UW-Madison. It scored 100% on the first run..
Two years ago I speculated nothing useful could run locally anytime soon. I was wrong. Kids can now have a free, grad level TA, running on their PC