Jack @JackNotOld - Twitter Profile

Pinned Tweet

10 days ago

Put out my first LessWrong blog post! Interpretability treats steering directions like "control knobs". I checked whether that assumption is mathematically valid across 8 different models. At α = 1, it breaks in 92% of cases. https://t.co/esn7gJZ0wM

JackNotOld's tweet photo. Put out my first LessWrong blog post!

Interpretability treats steering directions like "control knobs". I checked whether that assumption is mathematically valid across 8 different models.

At α = 1, it breaks in 92% of cases.

https://t.co/esn7gJZ0wM https://t.co/WU7tll2l9R

0

1

0

125

Jack @JackNotOld

about 3 hours ago

The @ElevenLabs robot is serving coffee in SoHo.

0

21

JackNotOld retweeted

elie

@eliebakouch

3 days ago

WOW microsoft new "MAI Thinking 1" model comes with a 109 page tech report that looks REALLY detailed, this is amazing

24

984

121

679

196K

Jack @JackNotOld

10 days ago

@Andres_Nava_12 @MatthieuWyart Awesome!

0

189

Who to follow

Cook

@Cooklo_

DM for "Cook ACO" discord invite @CookloFNF

aycdjake

@aycdjake

@aycdio co founder | no promos | YouTube @aycdjake

Jevi

@re_jevi

Provide Akamai/Kasada/Cloudflare API jevi#0417 [email protected]

Jack @JackNotOld

14 days ago

@corefpark Yeah for sure am curious whether atlantis lands in a coherent off-manifold structure after distance fine-tunes or just scatters. let me know if you run it.

0

1

0

20

JackNotOld retweeted

Ali Hatamizadeh

@ahatamiz1

15 days ago

Gated DeltaNet-2 is here. 🚀 🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆 💡 Here's the idea behind it: Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it. Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation. Gated DeltaNet-2 decouples them. ✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton 📊 Results: We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3. Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38 Joint work with @YejinChoinka and @jankautz. 📄 Paper: https://t.co/Zw6yXbHjGU 💻 Code: https://t.co/s8IWwaRU18 #LinearAttention #StateSpaceModels #Mamba #LLM

ahatamiz1's tweet photo. Gated DeltaNet-2 is here. 🚀

🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆

💡 Here's the idea behind it:

Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it.

Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation.

Gated DeltaNet-2 decouples them.

✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove
✍️ a channel-wise write gate w_t picks which value-side coordinates to commit
🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too
⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton

📊 Results:

We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3.

Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings
Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38

Joint work with @YejinChoinka and @jankautz.

📄 Paper: https://t.co/Zw6yXbHjGU
💻 Code: https://t.co/s8IWwaRU18

#LinearAttention #StateSpaceModels #Mamba #LLM

25

645

98

431

193K

Jack @JackNotOld

15 days ago

@eliebakouch Crazy timing! https://t.co/FjFuPDJiru Just published paper about SAE features in Qwen recurrent writes that behaves like an erase operation Feels like these architectures are converging toward increasingly interpretable state dynamics

Jack @JackNotOld

15 days ago

New paper! Trained an SAE on Qwen's recurrent state writes. Found an "erase" feature. Substituting it for the model's "write" drops the target token from next-token logits. The shift factors through forget, read, output at R²=0.98 with no fitted params. https://t.co/XrlF3Ekx9G

JackNotOld's tweet photo. New paper! Trained an SAE on Qwen's recurrent state writes.

Found an "erase" feature. Substituting it for the model's "write" drops the target token from next-token logits. The shift factors through forget, read, output at R²=0.98 with no fitted params.

https://t.co/XrlF3Ekx9G

2

4

0

3

894

0

1

0

1

666

Jack @JackNotOld

15 days ago

Interesting, though gradient misalignment alone doesn’t necessarily mean a separate manifold... distance task may induce a uniquely structured gradient direction (making it look orthogonal to other tasks) even with same underlying geometry Would be really interesting to see whether those gradients project onto the same principal directions as the other tasks

1

0

36

Jack @JackNotOld

15 days ago

HF: https://t.co/oh8DRVK8RU Code: https://t.co/cSXpZMo2AR

0

2

0

119

Jack @JackNotOld

15 days ago

New paper! Trained an SAE on Qwen's recurrent state writes. Found an "erase" feature. Substituting it for the model's "write" drops the target token from next-token logits. The shift factors through forget, read, output at R²=0.98 with no fitted params. https://t.co/XrlF3Ekx9G

2

4

0

3

894

JackNotOld retweeted

Jediah Katz

@jediahkatz

17 days ago

i would never hire anyone with a 4 year resume gap

226

8K

154

668

2M

JackNotOld retweeted

Citrini

@citrini

22 days ago

Morgan Stanley’s price discovery happens on @tradexyz

38

2K

127

265

219K

Jack @JackNotOld

25 days ago

“Directionally very interactive”

Thinking Machines

@thinkymachines

25 days ago

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. https://t.co/AFJZ5kH7Ku

462

16K

2K

12K

8M

0

2

0

131

Jack @JackNotOld

about 2 months ago

@natolambert @scaling01 Congrats!

0

179

Jack @JackNotOld

about 2 months ago

@edwardjhu @OpenAI @Yoshua_Bengio @Mila_Quebec Congratulations!

0

1

0

375

Jack @JackNotOld

about 2 months ago

Trained states and dataset now on @huggingface Hub 🤗 Hybrid models (Qwen3.5, FalconH1) initialize 75% of their parameters to zero. We trained those initial states on 45 verified solutions: +23.6pp on HumanEval, +10.8pp over LoRA, zero inference overhead. Try S₀ tuning on Qwen3.5-4B without training: https://t.co/EyLKeEkveI Training data (45 verified HumanEval solutions): https://t.co/gyXwOnyq74 Github: https://t.co/A4KwMeGdYA Paper: https://t.co/fsakFtE5Vj

JackNotOld's tweet photo. Trained states and dataset now on @huggingface Hub 🤗

Hybrid models (Qwen3.5, FalconH1) initialize 75% of their parameters to zero. We trained those initial states on 45 verified solutions: +23.6pp on HumanEval, +10.8pp over LoRA, zero inference overhead.

Try S₀ tuning on Qwen3.5-4B without training:
https://t.co/EyLKeEkveI

Training data (45 verified HumanEval solutions):
https://t.co/gyXwOnyq74

Github: https://t.co/A4KwMeGdYA
Paper: https://t.co/fsakFtE5Vj

0

2

0

218

JackNotOld retweeted

martin_casado

@martin_casado

about 2 months ago

Mythos appears to be the first class of models trained at scale on Blackwells. Then will be Vera Rubins. Pre-training isn't saturated. RL works. And there is *so much* computing coming online soon. Buckle your chin strips. It's going to be fucking wild.

106

4K

306

626

453K

Jack @JackNotOld

2 months ago

@samsheffer @GoogleDeepMind @OfficialLoganK @GoogleAIStudio Congrats!

1

0

192

JackNotOld retweeted

Jack @JackNotOld

2 months ago

Code: https://t.co/MqrEhJs4hR Paper: https://t.co/KeFrHyzBhQ pip install s0-tuning This suggests a different axis of adaptation: state, not weights.

JackNotOld's tweet photo. Code: https://t.co/MqrEhJs4hR
Paper: https://t.co/KeFrHyzBhQ
pip install s0-tuning

This suggests a different axis of adaptation: state, not weights. https://t.co/QKVEkxl3wH

0

1

0

198

Jack

@JackNotOld

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users