Plugyawn @plugyawn - Twitter Profile

like, is gemini flash or whatever not system-prompted with "you are in a search-bar. the user has searched for <this term>. you are google search bar."

plugyawn's tweet photo. like, is gemini flash or whatever not system-prompted with "you are in a search-bar. the user has searched for <this term>. you are google search bar." https://t.co/EPS654869y

1

0

36

Who to follow

Ksheer Sagar Agrawal

@ksheer_sagar

Grad @ UCSD, ML systems, previously SDE @ AWS DynamoDB

Shril Mody

@shril_mody

MS in ECE @ UC SanDiego

Haikoo Khandor

@HaikooKhandor

CS @TAMU’26 | AI/ML Research, CSE @IITGN'24

Plugyawn

@plugyawn

about 2 hours ago

they named it Claude so it rhymes with God?

0

16

Plugyawn

@plugyawn

about 2 hours ago

why isn't the highlight clickable? why??

0

18

Plugyawn

@plugyawn

about 2 hours ago

Waiting for the day I type half the name of a paper in the search bar and my agent goes boss your phd friend sent you this on whatsapp a few weeks ago already.

0

9

Plugyawn

@plugyawn

about 5 hours ago

@kalomaze Ye that’s a reasonable point then. Reminded me of the old fp32 deep linear nets article of old OpenAI fame.

0

34

Plugyawn

@plugyawn

about 20 hours ago

If you have say a blackbox O(context^2) op that underneath autoregressively evicts the current cache and reloads it with the post-write state (like a log), wouldn’t that be approximately indistinguishable from the write you mentioned? Assuming the KV cache is expressive, etc. i.e somewhere in the data/chip/algo trilemma, one gives up the “quick write” for an opportunity to scale on the data?

0

1

0

18

Plugyawn

@plugyawn

2 days ago

I used to think that the transformer isn't an NTM because it can't write to the KVCache, only read. but can't it technically generate its context-length worth of tokens to mutate the KVCache the way it wants? Isn't that just an expensive O(C) write? good sir @recurseparadox i remember you posted something about NTMs, i'm sure you have the answers i need, I'm not sure if I should be convinced by GPT's answer.

2

0

1

623

plugyawn retweeted

TracketPacer

@TracketPacer

1 day ago

68

25K

2K

860

319K

Plugyawn

@plugyawn

about 21 hours ago

@praveenvnktsh @_philschmid At the limit I suppose the model subsumes the harness.

0

1

0

11

plugyawn retweeted

messed up computers

@messeduppcs

3 days ago

122

33K

2K

1K

444K

Plugyawn

@plugyawn

1 day ago

@weeyev I think @RisingSayak could have some nice advice haha.

0

1

0

15

Plugyawn

@plugyawn

2 days ago

https://t.co/PuGjNCV73p fun lil thing v. far from usual, would love if someone could star/contribute!

2

3

1

2

207

Plugyawn

@plugyawn

1 day ago

@weeyev There's some good work on trainable log linear attention for DiTs this time at CVPR, i reckon that could be useful here too! https://t.co/dZS46mx5aW

Chen Change Loy

@ccloy

1 day ago

Proud to share our lab’s @MMLabNTU work Log-linear Sparse Attention (LLSA) - a trainable sparse attention mechanism that reduces attention complexity from O(N²) to O(N log N), making diffusion transformers much more efficient. Also, special shout-out to the first author @zhouyifan1107 for presenting the poster in full costume - truly above and beyond. The level of dedication is impressive! 👏 #CVPR2026 #DiffusionModels #EfficientAI #SparseAttention

ccloy's tweet photo. Proud to share our lab’s @MMLabNTU work Log-linear Sparse Attention (LLSA) - a trainable sparse attention mechanism that reduces attention complexity from O(N²) to O(N log N), making diffusion transformers much more efficient.

Also, special shout-out to the first author @zhouyifan1107 for presenting the poster in full costume - truly above and beyond. The level of dedication is impressive! 👏

#CVPR2026 #DiffusionModels #EfficientAI #SparseAttention

14

796

85

266

95K

2

1

0

107

Plugyawn

@plugyawn

1 day ago

@weeyev Thanks @weeyev! DMs are open if you have any questions, would love some contributors. The aim is model-based RL!

1

0

19

Plugyawn

@plugyawn

1 day ago

Note that this is all K=1! On an ensemble, RandOpt is even stronger. If you want to try a fast implementation, check out: https://t.co/w1UdIjd7pf

plugyawn's tweet photo. Note that this is all K=1! On an ensemble, RandOpt is even stronger.
If you want to try a fast implementation, check out: https://t.co/w1UdIjd7pf https://t.co/gZDg441v2x

0

1

0

44

Plugyawn

@plugyawn

1 day ago

RandOpt is a great, cheap, fast bootstrapping method for raising your pass@K before you start your GRPO. Here's a 1.5B model gaining 7.3% on the hardest parts of the MATH datasets on an NVIDIA L4(!!) in just 40 minutes.

plugyawn's tweet photo. RandOpt is a great, cheap, fast bootstrapping method for raising your pass@K before you start your GRPO.

Here's a 1.5B model gaining 7.3% on the hardest parts of the MATH datasets on an NVIDIA L4(!!) in just 40 minutes. https://t.co/CE0OKUuUB3

2

1

0

1

92

Plugyawn

@plugyawn

1 day ago

What if we use an L40 instead? By injecting structured low-rank noise so the noise-swapping process is amortized, RandOpt directly scales with inference speed. Here's a 16% raise in Countdown in just 200 seconds!

plugyawn's tweet photo. What if we use an L40 instead? By injecting structured low-rank noise so the noise-swapping process is amortized, RandOpt directly scales with inference speed.

Here's a 16% raise in Countdown in just 200 seconds! https://t.co/soZ8jSPhhy

1

0

56

Plugyawn

@plugyawn

2 days ago

@_TarunKathuria I guess, the point was more that: most NTM/universal transformer-adjacent ideas employ hard-to-scale ideas like recursion to allow a write operation; but yesssss, i think most statements don't take into account the iterative nature of how most of these systems are deployed.

1

0

30

Plugyawn

@plugyawn

2 days ago

@_TarunKathuria could you pitch in too, is there a simpler argument to transformer expressivity i'm missing?

Plugyawn

@plugyawn

2 days ago

I used to think that the transformer isn't an NTM because it can't write to the KVCache, only read. but can't it technically generate its context-length worth of tokens to mutate the KVCache the way it wants? Isn't that just an expensive O(C) write? good sir @recurseparadox i remember you posted something about NTMs, i'm sure you have the answers i need, I'm not sure if I should be convinced by GPT's answer.

2

0

1

623

1

0

140

Plugyawn

@plugyawn

2 days ago

hmmmmmmmmmm i'd be inclined to think if you have bounded context length C (say, 10) and you have "abcde" in memory, then an "update 3 : c" rule (which would turn it into "abcce") can be simulated by pushing "abcde" out of context, "abcde00000" and then incontext copying "0000abcce", which to a coarse blackbox observer should be indistinguishable. of course you'd have to keep the command itself in memory, etc, etc.

plugyawn's tweet photo. hmmmmmmmmmm i'd be inclined to think if you have bounded context length C (say, 10) and you have "abcde" in memory, then an "update 3 : c" rule (which would turn it into "abcce") can be simulated by pushing "abcde" out of context, "abcde00000" and then incontext copying "0000abcce", which to a coarse blackbox observer should be indistinguishable.

of course you'd have to keep the command itself in memory, etc, etc.

0

1

0

52

Plugyawn

@plugyawn

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users