Akshit @akshitwt - Twitter Profile

Pinned Tweet

9 months ago

a skill that i am really proud of is my ability to iterate on experiments fast, and write "good" code. writing code is an important skill to have as a researcher, and in this post i discuss some tips to hopefully help you get better at it!

akshitwt's tweet photo. a skill that i am really proud of is my ability to iterate on experiments fast, and write "good" code.
writing code is an important skill to have as a researcher, and in this post i discuss some tips to hopefully help you get better at it! https://t.co/yo0LgNTQQ9

19

775

37

825

65K

Akshit

@akshitwt

about 6 hours ago

can someone please solve for agents explicitly putting 2025 in every search i ask them to do

0

1

0

128

Akshit

@akshitwt

about 17 hours ago

@shubham_tweeetz the last question by isaac isomov

0

2

Akshit

@akshitwt

1 day ago

the first 60 or so pages were so good... they could've done so much with the world.. it was so interesting.. only to throw it all away for 40 science theories being infodumped in the last 20 pages and for what

Phil Hoyeck

@PAHoyeck

2 days ago

Finally finished The Three Body Problem. Very sorry to my friends who like it, but I found it unfathomably bad in practically every way. Worst science fiction I've read in a long time. Won't be reading the sequels. 1.5 out of 5.

PAHoyeck's tweet photo. Finally finished The Three Body Problem. Very sorry to my friends who like it, but I found it unfathomably bad in practically every way. Worst science fiction I've read in a long time. Won't be reading the sequels. 1.5 out of 5. https://t.co/KP6AMESiqm

110

350

13

48

131K

1

8

0

1

775

Akshit

@akshitwt

about 21 hours ago

@kreepkroop i used to read so much my parents had to hide my books from me so i don't spend all my time reading lmao

2

5

0

215

akshitwt retweeted

clem 🤗

@ClementDelangue

6 days ago

Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error. The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the dice. Usually the round-trip gives back the same tokens. Sometimes it doesn't and your gradient lands on a sequence the model never actually sampled. No crash. Just quietly wrong math and a useless gradient signal. The fix is one rule: never re-encode tokens you've decoded. Keep the sampled tokens in one buffer, never re-render them, and both failure modes disappear. That's Token-In, Token-Out done right. Our team just published a beautiful deep-dive on exactly this, including an audit across the major open-weights model families showing most chat templates already support it. Required reading if you're doing multi-turn RL 🤗🔥 https://t.co/zmx0EQl3jM

ClementDelangue's tweet photo. Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea.

Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error.

The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the dice. Usually the round-trip gives back the same tokens. Sometimes it doesn't and your gradient lands on a sequence the model never actually sampled. No crash. Just quietly wrong math and a useless gradient signal.

The fix is one rule: never re-encode tokens you've decoded. Keep the sampled tokens in one buffer, never re-render them, and both failure modes disappear. That's Token-In, Token-Out done right.

Our team just published a beautiful deep-dive on exactly this, including an audit across the major open-weights model families showing most chat templates already support it. Required reading if you're doing multi-turn RL 🤗🔥

https://t.co/zmx0EQl3jM

49

1K

140

1K

1M

Akshit

@akshitwt

6 days ago

idk how people say they are doing research with agents... every single thing models have come up with is the most useless idea ive heard. still only good for implementation (that too requires so much babysitting to get correct) (yes this is with highest reasoning, etc)

akshitwt's tweet photo. idk how people say they are doing research with agents...
every single thing models have come up with is the most useless idea ive heard.

still only good for implementation (that too requires so much babysitting to get correct)

(yes this is with highest reasoning, etc) https://t.co/JzbtKRZa40

1

20

0

1

5K

Akshit

@akshitwt

6 days ago

@ShashwatGoel7 @leothecurious hopemaxxing

0

1

0

33

akshitwt retweeted

Proximal @ProximalHQ

6 days ago

We evaluated Claude Opus 4.8 on FrontierSWE ahead of today's release. It is now the best-performing model on FrontierSWE.

6

127

13

10

19K

Akshit

@akshitwt

7 days ago

@ShashwatGoel7 @leothecurious nah i was gonna propose this for my thesis ffs 😭😭😭

2

0

1

169

akshitwt retweeted

David Stutz @davidstutz92

9 days ago

Great take on the importance of evals as upstream of everything, including training oftentimes. I'd go one step further and say that proper evaluation is becoming and exciting standalone discipline within AI: https://t.co/OPAEEL4bps

2

33

2

35

7K

Akshit

@akshitwt

9 days ago

adding a very popular gem to my tech collection thanks to @Prolific! my dad will enjoy this a lot now that he spends his time toying with claude and codex :p Another perk of attending conferences like ICLR :)

akshitwt's tweet photo. adding a very popular gem to my tech collection thanks to @Prolific! my dad will enjoy this a lot now that he spends his time toying with claude and codex :p
Another perk of attending conferences like ICLR :) https://t.co/VesBTYtWuA

0

12

0

1

542

Akshit

@akshitwt

9 days ago

surprised that in all the math/ML courses ive taken this has never come up https://t.co/ZQV0fTNQnr

0

45

1

88

5K

Akshit

@akshitwt

9 days ago

thank you @natolambert 🙇

1

0

359

Akshit

@akshitwt

10 days ago

unbelievably good series on how traditional RL evolved into everything we see today!!! my rule of thumb for good educational content is combining technical details with the context/history surrounding them; it gives a much richer understanding of why things are the way they are

akshitwt's tweet photo. unbelievably good series on how traditional RL evolved into everything we see today!!!
my rule of thumb for good educational content is combining technical details with the context/history surrounding them; it gives a much richer understanding of why things are the way they are https://t.co/Lk9ij9ga0z

4

188

14

207

6K

Akshit

@akshitwt

10 days ago

@learning_pt

0

6

0

320

akshitwt retweeted

Ryan Bahlous-Boldi

@RyanBoldi

13 days ago

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

RyanBoldi's tweet photo. Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*.
We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

35

845

120

783

210K

Akshit

@akshitwt

13 days ago

@ShashwatGoel7 is this only applicable for pretraining? or further training also

1

0

260

akshitwt retweeted

Maksym Andriushchenko

@maksym_andr

15 days ago

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation. AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters. Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task. Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE). One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now. This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

maksym_andr's tweet photo. 💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.

AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.

Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.

Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).

One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.

This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

12

346

48

212

42K

Akshit

@akshitwt

Last Seen Users on Sotwe

Trends for you

Most Popular Users