If you, like me, just woke up, let me catch you up on the Claude Code Leak (I know nothing, all conjecture):
> Someone inside Anthropic, got switched to Adaptive reasoning mode
> Their Claude Code switched to Sonnet
> Committed the .map file of Claude Code
> Effectively leaking the ENTIRE CC Source Code
> @realsigridjin was tired after running 2 south korean hackathons in SF, saw the leak
> Rules in Korea are different, he cloned the repo, went to sleep
> Wakes up to 25K stars, and his GF begging him to take it down (she's a copyright lawyer)
> Their team decided - how about we have agents rewrite this in Python!? Surely... this is more legal
> Rewrite in Py
> Board a plane to SK🇰🇷
> One of the guys decides python is slow, is now rewriting ALL OF CLAUDE CODE into Rust.
> Anthropic cannot take down, cannot sue
> Is this "fair use?"
> TL;DR - we're about to have open source Claude Code in Rust
Software horror: litellm PyPI supply chain attack.
Simple `pip install litellm` was enough to exfiltrate SSH keys, AWS/GCP/Azure creds, Kubernetes configs, git credentials, env vars (all your API keys), shell history, crypto wallets, SSL private keys, CI/CD secrets, database passwords.
LiteLLM itself has 97 million downloads per month which is already terrible, but much worse, the contagion spreads to any project that depends on litellm. For example, if you did `pip install dspy` (which depended on litellm>=1.64.0), you'd also be pwnd. Same for any other large project that depended on litellm.
Afaict the poisoned version was up for only less than ~1 hour. The attack had a bug which led to its discovery - Callum McMahon was using an MCP plugin inside Cursor that pulled in litellm as a transitive dependency. When litellm 1.82.8 installed, their machine ran out of RAM and crashed. So if the attacker didn't vibe code this attack it could have been undetected for many days or weeks.
Supply chain attacks like this are basically the scariest thing imaginable in modern software. Every time you install any depedency you could be pulling in a poisoned package anywhere deep inside its entire depedency tree. This is especially risky with large projects that might have lots and lots of dependencies. The credentials that do get stolen in each attack can then be used to take over more accounts and compromise more packages.
Classical software engineering would have you believe that dependencies are good (we're building pyramids from bricks), but imo this has to be re-evaluated, and it's why I've been so growingly averse to them, preferring to use LLMs to "yoink" functionality when it's simple enough and possible.
Just read Karpathy’s nanochat experiment with an “8-agent research org.” It looks beautiful, but the takeaway is brutal: agents don’t fail at execution, they fail at research.
They implement well, but by default they don’t set strong baselines, don’t do proper ablations, don’t control compute/time, and don’t design experiments. You end up with outputs that look like discoveries but are mostly noise.
I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :)
I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p.
But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them.
But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?
The real direction probably isn’t adding more agents, but turning the research process itself into org code: task definitions, constraints, control groups, budgets, stop conditions, postmortems, audit logs.
I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :)
I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p.
But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them.
But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?
Fine-tuning just got a whole lot easier.
Serverless SFT is now in public preview on W&B!
Managed infrastructure (powered by @CoreWeave) that auto-scales to your training workloads. No cluster setup. No idle GPU costs.
Day 0 of my 1-bit / CPU experiment.
Ran the reference demo on a MacBook so far the outputs look like this 👇
Goal for this series:
keep 1-bit / CPU-friendly inference,
make the outputs actually useful.
I ran the reference 1-bit demo.
The outputs are hilariously bad.
Now I’m building a more serious 1-bit stack on my side: better architecture + training, same CPU-friendly inference.
Goal: show that “1-bit” doesn’t have to mean “garbage outputs”.
Don’t think this “kills the GPU mafia” at all.
1-bit / BitNet makes CPU inference cheaper, so we can ship more AI features and justify more spend on big GPU training.
From an infra POV, that’s a demand amplifier, not a GPU funeral.
Saving this to try later.
Microsoft killed the GPU mafia 🤯
They finally open-sourced their 1-bit LLM inference framework called bitnet.cpp. It lets you run 100B parameter models on your local CPU without GPUs.
- 6.17x faster inference
- 82.2% less energy on CPUs
100% Open Source.
Also I genuinely interested in this direction.
If you see any serious write-ups, papers, or people building real systems around 1-bit / CPU inference (model + infra), please send them my way.
I ran the reference 1-bit demo.
The outputs are hilariously bad.
Now I’m building a more serious 1-bit stack on my side: better architecture + training, same CPU-friendly inference.
Goal: show that “1-bit” doesn’t have to mean “garbage outputs”.
2) Moats are shifting from models to control points.
Model quality will converge. The winners sit where switching hurts: procurement, workflow embedding, data rights, compliance, feedback loops.
3) If you’re building a thin wrapper, you’re on borrowed time.
The surviving strategy is to own a bottleneck the platform can’t easily bundle away:
a regulated workflow, a proprietary dataset with clean rights, or a distribution wedge.
My take: the game is moving from “who’s smartest” to who owns the choke point.
SoftBank reportedly discussing up to $30B more into OpenAI.
If the broader round gets to ~$100B, this isn’t “funding news” — it’s a pricing + distribution war declaration.
Here are the only 3 things that matter for founders:
1) Balance sheets can bend the market.
When someone can subsidize inference and bundle everything into one contract, “better product” stops being the main advantage. You’re competing against a temporary distortion — not just a model.