Richard Song @resongy - Twitter Profile

Pinned Tweet

Richard Song

@ReSongy

7 months ago

Accidentally made a podcast. Model drift is real — I no longer know what I’m fine-tuning myself into.

2

0

295

ReSongy retweeted

Alex Volkov

@altryne

3 months ago

If you, like me, just woke up, let me catch you up on the Claude Code Leak (I know nothing, all conjecture): > Someone inside Anthropic, got switched to Adaptive reasoning mode > Their Claude Code switched to Sonnet > Committed the .map file of Claude Code > Effectively leaking the ENTIRE CC Source Code > @realsigridjin was tired after running 2 south korean hackathons in SF, saw the leak > Rules in Korea are different, he cloned the repo, went to sleep > Wakes up to 25K stars, and his GF begging him to take it down (she's a copyright lawyer) > Their team decided - how about we have agents rewrite this in Python!? Surely... this is more legal > Rewrite in Py > Board a plane to SK🇰🇷 > One of the guys decides python is slow, is now rewriting ALL OF CLAUDE CODE into Rust. > Anthropic cannot take down, cannot sue > Is this "fair use?" > TL;DR - we're about to have open source Claude Code in Rust

altryne's tweet photo. If you, like me, just woke up, let me catch you up on the Claude Code Leak (I know nothing, all conjecture):

> Someone inside Anthropic, got switched to Adaptive reasoning mode
> Their Claude Code switched to Sonnet
> Committed the .map file of Claude Code
> Effectively leaking the ENTIRE CC Source Code
> @realsigridjin was tired after running 2 south korean hackathons in SF, saw the leak
> Rules in Korea are different, he cloned the repo, went to sleep
> Wakes up to 25K stars, and his GF begging him to take it down (she's a copyright lawyer)
> Their team decided - how about we have agents rewrite this in Python!? Surely... this is more legal
> Rewrite in Py
> Board a plane to SK🇰🇷
> One of the guys decides python is slow, is now rewriting ALL OF CLAUDE CODE into Rust.
> Anthropic cannot take down, cannot sue
> Is this "fair use?"
> TL;DR - we're about to have open source Claude Code in Rust

345

12K

1K

6K

2M

Richard Song

@ReSongy

3 months ago

我的中文圈子里传播一下。大家留意。

Andrej Karpathy

@karpathy

3 months ago

Software horror: litellm PyPI supply chain attack. Simple `pip install litellm` was enough to exfiltrate SSH keys, AWS/GCP/Azure creds, Kubernetes configs, git credentials, env vars (all your API keys), shell history, crypto wallets, SSL private keys, CI/CD secrets, database passwords. LiteLLM itself has 97 million downloads per month which is already terrible, but much worse, the contagion spreads to any project that depends on litellm. For example, if you did `pip install dspy` (which depended on litellm>=1.64.0), you'd also be pwnd. Same for any other large project that depended on litellm. Afaict the poisoned version was up for only less than ~1 hour. The attack had a bug which led to its discovery - Callum McMahon was using an MCP plugin inside Cursor that pulled in litellm as a transitive dependency. When litellm 1.82.8 installed, their machine ran out of RAM and crashed. So if the attacker didn't vibe code this attack it could have been undetected for many days or weeks. Supply chain attacks like this are basically the scariest thing imaginable in modern software. Every time you install any depedency you could be pulling in a poisoned package anywhere deep inside its entire depedency tree. This is especially risky with large projects that might have lots and lots of dependencies. The credentials that do get stolen in each attack can then be used to take over more accounts and compromise more packages. Classical software engineering would have you believe that dependencies are good (we're building pyramids from bricks), but imo this has to be re-evaluated, and it's why I've been so growingly averse to them, preferring to use LLMs to "yoink" functionality when it's simple enough and possible.

1K

28K

5K

14K

67M

0

1

0

39

Richard Song

@ReSongy

4 months ago

Treat every run as an eval that decides whether a branch gets merged or killed. Without an audit layer, agents are just burning GPU to hallucinate.

0

28

Richard Song

@ReSongy

4 months ago

Just read Karpathy’s nanochat experiment with an “8-agent research org.” It looks beautiful, but the takeaway is brutal: agents don’t fail at execution, they fail at research. They implement well, but by default they don’t set strong baselines, don’t do proper ablations, don’t control compute/time, and don’t design experiments. You end up with outputs that look like discoveries but are mostly noise.

Andrej Karpathy

@karpathy

4 months ago

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

561

9K

794

7K

2M

1

0

108

Richard Song

@ReSongy

4 months ago

The real direction probably isn’t adding more agents, but turning the research process itself into org code: task definitions, constraints, control groups, budgets, stop conditions, postmortems, audit logs.

1

0

31

ReSongy retweeted

Andrej Karpathy

@karpathy

4 months ago

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

561

9K

794

7K

2M

ReSongy retweeted

Weights & Biases

@wandb

4 months ago

Fine-tuning just got a whole lot easier. Serverless SFT is now in public preview on W&B! Managed infrastructure (powered by @CoreWeave) that auto-scales to your training workloads. No cluster setup. No idle GPU costs.

5

172

21

35

251K

Richard Song

@ReSongy

5 months ago

Automatically captures and re-injects coding context for Claude Code sessions https://t.co/dj7AtA9JKU

0

1

0

104

Richard Song

@ReSongy

5 months ago

Open native multimodal agentic model with thinking + instant modes https://t.co/53XIc3wPHR

0

79

Richard Song

@ReSongy

5 months ago

Realtime speech-to-speech with voice + persona control https://t.co/N1ZtTAyxRQ

0

1

106

Richard Song

@ReSongy

5 months ago

Day 0 of my 1-bit / CPU experiment. Ran the reference demo on a MacBook so far the outputs look like this 👇 Goal for this series: keep 1-bit / CPU-friendly inference, make the outputs actually useful.

Richard Song

@ReSongy

5 months ago

I ran the reference 1-bit demo. The outputs are hilariously bad. Now I’m building a more serious 1-bit stack on my side: better architecture + training, same CPU-friendly inference. Goal: show that “1-bit” doesn’t have to mean “garbage outputs”.

ReSongy's tweet photo. I ran the reference 1-bit demo.

The outputs are hilariously bad.

Now I’m building a more serious 1-bit stack on my side: better architecture + training, same CPU-friendly inference.

Goal: show that “1-bit” doesn’t have to mean “garbage outputs”. https://t.co/jaG1bHpbx3

1

0

141

1

2

0

120

Richard Song

@ReSongy

5 months ago

Btw this is running on MAC book.

0

1

0

45

Richard Song

@ReSongy

5 months ago

Don’t think this “kills the GPU mafia” at all. 1-bit / BitNet makes CPU inference cheaper, so we can ship more AI features and justify more spend on big GPU training. From an infra POV, that’s a demand amplifier, not a GPU funeral. Saving this to try later.

Oliver Prompts

@oliviscusAI

5 months ago

Microsoft killed the GPU mafia 🤯 They finally open-sourced their 1-bit LLM inference framework called bitnet.cpp. It lets you run 100B parameter models on your local CPU without GPUs. - 6.17x faster inference - 82.2% less energy on CPUs 100% Open Source.

550

16K

2K

14K

2M

2

1

0

161

Richard Song

@ReSongy

5 months ago

Also I genuinely interested in this direction. If you see any serious write-ups, papers, or people building real systems around 1-bit / CPU inference (model + infra), please send them my way.

0

1

0

30

Richard Song

@ReSongy

5 months ago

I ran the reference 1-bit demo. The outputs are hilariously bad. Now I’m building a more serious 1-bit stack on my side: better architecture + training, same CPU-friendly inference. Goal: show that “1-bit” doesn’t have to mean “garbage outputs”.

1

0

141

Richard Song

@ReSongy

5 months ago

Robots don’t need more parameters, they need better physics. Been building world-models this whole time — cool to see NVIDIA leaning in too.

0

1

0

43

Richard Song

@ReSongy

5 months ago

2) Moats are shifting from models to control points. Model quality will converge. The winners sit where switching hurts: procurement, workflow embedding, data rights, compliance, feedback loops. 3) If you’re building a thin wrapper, you’re on borrowed time. The surviving strategy is to own a bottleneck the platform can’t easily bundle away: a regulated workflow, a proprietary dataset with clean rights, or a distribution wedge. My take: the game is moving from “who’s smartest” to who owns the choke point.

0

40

Richard Song

@ReSongy

5 months ago

SoftBank reportedly discussing up to $30B more into OpenAI. If the broader round gets to ~$100B, this isn’t “funding news” — it’s a pricing + distribution war declaration. Here are the only 3 things that matter for founders: 1) Balance sheets can bend the market. When someone can subsidize inference and bundle everything into one contract, “better product” stops being the main advantage. You’re competing against a temporary distortion — not just a model.