Paul Janson @janson002 - Twitter Profile

Pinned Tweet

about 2 months ago

PyLO is accepted to MLSys 2026! 🎉🚀 A PyTorch-native library bringing SOTA learned optimizers to the codebases most of us actually use — with fast CUDA kernels and real speedups on large-scale training. Drop-in ready, no more JAX-only barriers. Library: https://t.co/NTjBF64jD3

Paul Janson @janson002

about 1 year ago

Have you ever trained a neural network using a learned optimizer instead of AdamW? Doubt it: you're probably coding in Pytorch! Excited to introduce PyLO: Towards Accessible Learned Optimizers in Pytorch! . Accepted at @icmlconf ICML 2025 CODEML workshop 🧵1/N

1

37

14

12

7K

1

11

8

4

2K

janson002 retweeted

Stefan Horoi

@stefanhoroi

about 14 hours ago

🎉 Our paper "𝗙𝗿𝗼𝗺 𝗠𝗲𝗺𝗼𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝘁𝗼 𝗣𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿 𝗜𝗻𝘁𝗲𝗿𝗳𝗲𝗿𝗲𝗻𝗰𝗲: 𝗛𝗼𝘄 𝗢𝘃𝗲𝗿𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗘𝘅𝗽𝗲𝗿𝘁𝘀 𝗛𝗮𝗿𝗺𝘀 𝗠𝗼𝗱𝗲𝗹 𝗠𝗲𝗿𝗴𝗶𝗻𝗴" was accepted at ICML 2026! 🔎 Do better expert models always lead to better merged models? Not necessarily! 📜Read the paper: https://t.co/aJN7Oi8Dw2 🧵 1/9

1

18

10

8

720

janson002 retweeted

VAIBHAV SINGH

@VAIBHAV22155287

10 days ago

Training big models gets painful once a full replica won't fit on one accelerator. You end up with model-parallel methods or techniques like FSDP that are communication-heavy and limited in how far they parallelize. We tried a new axis that lets you split the model the way model parallelism does, but communicate gradients instead of activations. 🧵 1/N

1

12

4

11

1K

janson002 retweeted

Dane Malenfant

@dvnxmvl_hdf5

22 days ago

🚨Excited to announce our workshop Context Beyond the Window hosted at COLM in SF! 🚨 LLMs have finite context windows, yet real-world tasks demand absorbing, retaining, and acting on information that far exceeds any single prompt. 1/3 We're looking for submissions across: https://t.co/6y1ILeeC9A • Context compression 🧃 — token compaction, recursive subagent calls, and external memory for storing and retrieving information • Efficient architectures 🚀 — sub-quadratic attention variants that make extremely long context computationally feasible • Continual training 🌱 — test-time training on streaming data, context distillation, and knowledge accumulation through continued pre-training • Agentic memory systems 🐘 — scaffolds and test-time scaling techniques that improve knowledge retention and acquisition in LLMs • Evaluation 🎯 — benchmarking models on increasingly long-horizon tasks

dvnxmvl_hdf5's tweet photo. 🚨Excited to announce our workshop Context Beyond the Window hosted at COLM in SF! 🚨

LLMs have finite context windows, yet real-world tasks demand absorbing, retaining, and acting on information that far exceeds any single prompt.

1/3

We're looking for submissions across:

https://t.co/6y1ILeeC9A

• Context compression 🧃 — token compaction, recursive subagent calls, and external memory for storing and retrieving information
• Efficient architectures 🚀 — sub-quadratic attention variants that make extremely long context computationally feasible
• Continual training 🌱 — test-time training on streaming data, context distillation, and knowledge accumulation through continued pre-training
• Agentic memory systems 🐘 — scaffolds and test-time scaling techniques that improve knowledge retention and acquisition in LLMs
• Evaluation 🎯 — benchmarking models on increasingly long-horizon tasks

6

97

30

44

30K

Who to follow

Habib Slim

@habib__slim

3D Shapes + Language @KAUST_News @Adobe

dill

@dill_sunnyb11

ML & Software Engineer, Bicycle commuter

Jeongwhan Choi

@jeongwhan_choi

Jang Young-Sil Postdoctoral Fellow @ KAIST | Ph.D. @ Yonsei University https://t.co/RoadwMfV6R

janson002 retweeted

Benjamin Thérien @ MLSys 2026

@benjamintherien

30 days ago

I’ll be at #MLSys2026 this week to present “PyLO: Towards Accessible Learned Optimizers in PyTorch”! Come listen to @janson002’s presentation today from 3:45PM – 4:00PM in room 2 or join us at poster 29. If you work on similar topics or just want to chat — DM me. 1/2

benjamintherien's tweet photo. I’ll be at #MLSys2026 this week to present “PyLO: Towards Accessible Learned Optimizers in PyTorch”!

Come listen to @janson002’s presentation today from 3:45PM – 4:00PM in room 2 or join us at poster 29.

If you work on similar topics or just want to chat — DM me. 1/2 https://t.co/7URTXY9xwT

1

14

3

1

729

Paul Janson @janson002

30 days ago

Come visit our poster at Evergreen Ballroom at 6.30pm and oral presentation at Grand Ballroom 2 at 3.45pm. Paper: https://t.co/RAeCGmIVip

janson002's tweet photo. Come visit our poster at Evergreen Ballroom at 6.30pm and oral presentation at Grand Ballroom 2 at 3.45pm.
Paper: https://t.co/RAeCGmIVip https://t.co/cezy7NF0EV

0

2

0

79

Paul Janson @janson002

30 days ago

Presenting today at @MLSysConf 2026: PyLO🚀 Learned optimizers (like VeLO) have been stuck in JAX. PyLO brings them to PyTorch via the standard torch.optim interface, with CUDA kernels and HF Hub weight loading. #MLSys2026 https://t.co/NTjBF64jD3

1

4

2

0

196

janson002 retweeted

Volkan Cevher

@CevherLIONS

about 1 month ago

This is like a good stress test for optimizers. Kaon is basically Muon/lmo + spectral noise. It preserves the singular vectors of the gradient and randomizes only the positive singular weights. For exchangeable noise, the conditional expectation is the spectral-norm-ball lmo direction up to scale. Individual draws are not necessarily lmos tho. Freon’s map for c>1/2 is decreasing on the singular values, so the operator is non-monotone. Exact fixed-step Freon can fail even on a simple convex quadratic minimization near rank deficiency. Freon’s map for c<=1/2 (i.e., the monotone case) can also be analyzed using phi-convexity. Shameless plug: https://t.co/nm16wKDz9L

0

74

13

74

11K

janson002 retweeted

Tony S.F. @tonysilveti

about 1 month ago

New paper! We analyze proximal preconditioned gradient methods that extend Muon/Scion to handle nonconvex constraints (Stiefel manifold, spectral sphere, norm balls, ...) with convergence guarantees under heavy-tailed noise + variance reduction w/ STORM! https://t.co/px67BeR2fh

tonysilveti's tweet photo. New paper! We analyze proximal preconditioned gradient methods that extend Muon/Scion to handle nonconvex constraints (Stiefel manifold, spectral sphere, norm balls, ...) with convergence guarantees under heavy-tailed noise + variance reduction w/ STORM!
https://t.co/px67BeR2fh https://t.co/6Lo3f6pahy

1

165

29

114

12K

janson002 retweeted

Keller Jordan

@kellerjordan0

about 1 month ago

Modded-NanoGPT optimization result #13: @benjamintherien has achieved a new record of 3210 steps (-15), by wrapping NorMuonH in a MuLoCo-style outer Nesterov SGD. Compared to the target loss, this result has a p-value of p=1.3e-4. Compared to result #11, it has p=0.099.

kellerjordan0's tweet photo. Modded-NanoGPT optimization result #13: @benjamintherien has achieved a new record of 3210 steps (-15), by wrapping NorMuonH in a MuLoCo-style outer Nesterov SGD.

Compared to the target loss, this result has a p-value of p=1.3e-4. Compared to result #11, it has p=0.099. https://t.co/nmExPm1v3f

3

82

11

28

8K

Paul Janson @janson002

about 2 months ago

@WorldEdServices @WorldEdServices Applicants paying for credential evaluations deserve a two-way communication channel. Contact forms with no-reply responses aren't workable when documents go missing. DMing reference number and asking for a named case owner. Hoping for real help.

3

2

0

203

janson002 retweeted

Benjamin Thérien @ MLSys 2026

@benjamintherien

about 2 months ago

I’ll be at #ICLR2026 🇧🇷 this week to present “μLO: Compute-Efficient Meta-Generalization of Learned Optimizers” and give a talk about SparseLoCo at the Protocol Learning Workshop! If you work on these topics or just want to chat — DM me. 🧵1/3

benjamintherien's tweet photo. I’ll be at #ICLR2026 🇧🇷 this week to present “μLO: Compute-Efficient Meta-Generalization of Learned Optimizers” and give a talk about SparseLoCo at the Protocol Learning Workshop!

If you work on these topics or just want to chat — DM me. 🧵1/3 https://t.co/vUVeeSt59N

1

38

4

1

2K

janson002 retweeted

Abhinav Moudgil

@amoudgl

about 2 months ago

Heading to Rio 🇧🇷 to present our Celo line of work at #ICLR2026! Get in touch if you are curious about new avenues in neural network training or how we scaled learned optimizers from CIFAR-10 to GPT-3 🚀 Details ⬇️

1

17

5

3

2K

janson002 retweeted

Paul Janson @janson002

about 2 months ago

PyLO is accepted to MLSys 2026! 🎉🚀 A PyTorch-native library bringing SOTA learned optimizers to the codebases most of us actually use — with fast CUDA kernels and real speedups on large-scale training. Drop-in ready, no more JAX-only barriers. Library: https://t.co/NTjBF64jD3

1

11

8

4

2K

Paul Janson @janson002

about 2 months ago

Full Paper: https://t.co/RAeCGmIVip #PyTorch #MLSys2026 #LearnedOptimizers #Jax #CUDA #Kernels #MachineLearning #MLSystems

0

1

0

147

Paul Janson @janson002

about 2 months ago

PyLO is accepted to MLSys 2026! 🎉🚀 A PyTorch-native library bringing SOTA learned optimizers to the codebases most of us actually use — with fast CUDA kernels and real speedups on large-scale training. Drop-in ready, no more JAX-only barriers. Library: https://t.co/NTjBF64jD3

Paul Janson @janson002

about 1 year ago

Have you ever trained a neural network using a learned optimizer instead of AdamW? Doubt it: you're probably coding in Pytorch! Excited to introduce PyLO: Towards Accessible Learned Optimizers in Pytorch! . Accepted at @icmlconf ICML 2025 CODEML workshop 🧵1/N

1

37

14

12

7K

1

11

8

4

2K

janson002 retweeted

Michael Rizvi-Martel @frisbeemortel

2 months ago

Latent CoT is an alternative LLM reasoning scheme hypothesized to enable “superposition” allowing models to hold uncertainty over multiple concepts during reasoning 💭 We revisit superposition in 3 latent CoT approaches and find that it is largely an illusion 🔮! More in 🧵

frisbeemortel's tweet photo. Latent CoT is an alternative LLM reasoning scheme hypothesized to enable “superposition” allowing models to hold uncertainty over multiple concepts during reasoning 💭

We revisit superposition in 3 latent CoT approaches and find that it is largely an illusion 🔮!

More in 🧵 https://t.co/K8JBoMhiiD

9

167

33

123

14K

janson002 retweeted

Abhinav Moudgil

@amoudgl

3 months ago

Introducing Celo2: Towards Learned Optimization Free Lunch We show that learned optimizers can generalize to practical tasks like GPT-3 1.3B pretraining and several out-of-distribution vision/RL tasks from limited meta-training (~4.5 GPU hours)! 🧵

amoudgl's tweet photo. Introducing Celo2: Towards Learned Optimization Free Lunch

We show that learned optimizers can generalize to practical tasks like GPT-3 1.3B pretraining and several out-of-distribution vision/RL tasks from limited meta-training (~4.5 GPU hours)!

🧵 https://t.co/NuvB4qIzX7

3

103

22

81

9K

janson002 retweeted

templar @tplr_ai

3 months ago

We just completed the largest decentralised LLM pre-training run in history: Covenant-72B. Permissionless, on Bittensor subnet 3. 72B parameters. ~1.1T tokens. Commodity internet. No centralized cluster. No whitelist. Anyone with GPUs could join or leave freely. 1/n

215

6K

910

4K

2M

janson002 retweeted

Benjamin Thérien @ MLSys 2026

@benjamintherien

3 months ago

🚨 New Tech Report: Covenant-72B 🚨 TL;DR we use SparseLoCo to pre-train a 72B model on 1.1T tokens over the internet! This is the largest decentralized training run to date. https://t.co/vucL1Asr6w

2

70

7

11

5K

janson002 retweeted

VAIBHAV SINGH

@VAIBHAV22155287

4 months ago

Masked Diffusion LMs (MDLMs) are the most exciting paradigm shift in AR generation because they can decode in parallel, infill, and self-correct. But they are bottlenecked by the transformer's quadratic attention, making throughput fall apart for long contexts. We offer a simple solution. Introducing DiffuMamba: first diffusion LM with a bidirectional Mamba backbone. Better quality. Up to 8.2x faster. 🧵1/N

4

160

27

152

15K

Paul Janson

@janson002

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users