Caden Juang @kh4dien - Twitter Profile

Pinned Tweet

5 months ago

Last week's Boston Systems Reading Group kicked off with containers! We read through @b0rk’s zine on the subject and @ekzhang1’s SSH hypervisor came in handy for those of us experimenting with containers on Mac.

kh4dien's tweet photo. Last week's Boston Systems Reading Group kicked off with containers! We read through @b0rk’s zine on the subject and @ekzhang1’s SSH hypervisor came in handy for those of us experimenting with containers on Mac. https://t.co/IGMXzt7bJg

1

9

1

999

kh4dien retweeted

Christopher Potts

@ChrisGPotts

17 days ago

In honor of this paper's acknowledgments section:

0

34

7

10

7K

kh4dien retweeted

Intology

@intology

26 days ago

Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵

intology's tweet photo. Can coding agents do research?

We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress

Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research

NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵

22

281

61

173

145K

kh4dien retweeted

Lawrence Chan

@justanotherlaw

about 1 month ago

A recent viral paper claims to reverse-engineer the parameter counts of frontier models: GPT-5.5 = 9.7T, Opus 4.7 = 4.0T, o1 = 3.5T, etc. @ben_sturgeon and I investigated and found serious issues in the paper; fixing them gives GPT-5.5 as ~1.5T (90% CI: 256B-8.3T).

justanotherlaw's tweet photo. A recent viral paper claims to reverse-engineer the parameter counts of frontier models: GPT-5.5 = 9.7T, Opus 4.7 = 4.0T, o1 = 3.5T, etc.

@ben_sturgeon and I investigated and found serious issues in the paper; fixing them gives GPT-5.5 as ~1.5T (90% CI: 256B-8.3T). https://t.co/MbWQyVlmsE

29

953

95

419

210K

kh4dien retweeted

John Yang

@jyangballin

4 months ago

Across all mini-SWE-agent + <model> runs, SWE-bench Verified's current "ceiling"? - 87.4% (0.874 - 0.8) * 500 = another *37* instances that aren't solved consistently. If you recalculate this number across all official SWE-bench Verified submissions? - 95% from SWE-bench site

jyangballin's tweet photo. Across all mini-SWE-agent + <model> runs, SWE-bench Verified's current "ceiling"? - 87.4%

(0.874 - 0.8) * 500 = another *37* instances that aren't solved consistently.

If you recalculate this number across all official SWE-bench Verified submissions? - 95%

from SWE-bench site https://t.co/ITcULTs1W5

6

49

9

14

22K

Caden Juang

@kh4dien

2 months ago

@leonardtang_ Should check out Docent! (https://t.co/PmmLMoaVoZ)

1

4

0

1

643

kh4dien retweeted

sandrone

@kosenjuu

3 months ago

Finally, Astral Codex

16

1K

31

37

52K

Caden Juang

@kh4dien

3 months ago

@ellev3n11 🍕

0

1

0

66

kh4dien retweeted

Jaden Fiotto-Kaufman @jadenfk23

4 months ago

NNsight 0.6 is out now! We directly address your feedback in our biggest release yet. Pain points included cryptic errors, slow traces, no remote execution of custom code, and limited vLLM support. We tackle all of these and more in this new release. 🧵 Here's what changed:

1

38

10

17

8K

Caden Juang

@kh4dien

4 months ago

Achyuta Rajaram @AchyutaBot

4 months ago

AchyutaBot's tweet photo. https://t.co/OvebvXO9LV

2

23

0

1

3K

2

11

0

3K

Caden Juang

@kh4dien

4 months ago

@simonw Some recent work on the Codex 5 > 5.1 difference on TerminalBench: https://t.co/JZ04lRjoer

0

50

Caden Juang

@kh4dien

4 months ago

@simonw > It's interesting to see Claude Opus 4.5 beat Opus 4.6 There's a lot of nuance not reflected in top-level numbers, definitely check out the transcripts. The leaderboard links to some great tools for exploring them!

1

3

0

418

Caden Juang

@kh4dien

4 months ago

@uzpg_ https://t.co/PzaeRInqru

0

3

0

3

97

kh4dien retweeted

Grace Luo @graceluo_

4 months ago

We trained diffusion models on a billion LLM activations, and we want you to use them! New preprint: Learning a Generative Meta-Model of LLM Activations Joint work with @feng_jiahai, @trevordarrell, @AlecRad, @JacobSteinhardt. More in thread 🧵

32

1K

192

1K

222K

Caden Juang

@kh4dien

4 months ago

@corefpark @moltbook Cool! Did you scrape the posts / are they open source?

1

0

82

kh4dien retweeted

Core Francisco Park

@corefpark

4 months ago

@moltbook scaling to 50k posts in a day really made me think when these things come online, we don't have direct actionable oversight... I did the trivial thing. Embed and umap the posts: https://t.co/yaMZu8lrYt

1

13

2

3

985

kh4dien retweeted

Caden Juang

@kh4dien

5 months ago

Last week's Boston Systems Reading Group kicked off with containers! We read through @b0rk’s zine on the subject and @ekzhang1’s SSH hypervisor came in handy for those of us experimenting with containers on Mac.

1

9

1

999

Caden Juang

@kh4dien

5 months ago

Some resources from the first group: Julia Evan’s zine: https://t.co/8HryfTqBcb Eric Zhang’s SSH hypervisor: https://t.co/ys61lspuua And More: https://t.co/mxc8R0NJL8

0

1

0

174

Caden Juang

@kh4dien

5 months ago

Last week's Boston Systems Reading Group kicked off with containers! We read through @b0rk’s zine on the subject and @ekzhang1’s SSH hypervisor came in handy for those of us experimenting with containers on Mac.

1

9

1

999

Caden Juang

@kh4dien

5 months ago

Unfortunately this week’s reading group was postponed due to a winter storm, but come to the Boston Public Library next Sunday at 11am to read about k8’s and other container orchestration systems! https://t.co/mxc8R0NJL8

1

0

264

kh4dien retweeted

Caden Juang

@kh4dien

5 months ago

Starting a computer systems reading group in Boston! We'll meet weekly to explore databases, networking, compilers, distributed systems, and more. Our first meetup is next Sunday 1/18 at 11 AM. We’ll be talking about containers. All backgrounds welcome! 👇

1

3

1

0

409

Caden Juang

@kh4dien

Last Seen Users on Sotwe

Trends for you

Most Popular Users