Shengjie Luo @Roger98079446 - Twitter Profile

Pinned Tweet

over 3 years ago

#ICLR2023 New paper! "Rethinking the expressive power of GNNs via graph biconnectivity" accepted as an 𝗼𝗿𝗮𝗹 𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 (notable-top 5%) 🔥A new direction to study GNN expressivity via graph biconnectivity! 👇Let's see the details of our fruitful results🤗

Bohang Zhang @ICLR 2024 @bohang_zhang

over 3 years ago

Excited to see our paper "Rethinking the expressive power of GNNs via graph biconnectivity" accepted as an 𝗼𝗿𝗮𝗹 𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 (notable-top 5%) at #ICLR2023! https://t.co/jC6R1lCAPL Joint work with @Roger98079446, Liwei Wang, and Di He 1/n

bohang_zhang's tweet photo. Excited to see our paper "Rethinking the expressive power of GNNs via graph biconnectivity" accepted as an 𝗼𝗿𝗮𝗹 𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 (notable-top 5%) at #ICLR2023!

https://t.co/jC6R1lCAPL

Joint work with @Roger98079446, Liwei Wang, and Di He

1/n https://t.co/sKedEjZyu0

2

172

23

44

29K

0

30

0

4

6K

Roger98079446 retweeted

Tianle Cai

@tianle_cai

2 months ago

https://t.co/CivOb4riiJ

20

650

99

816

225K

Roger98079446 retweeted

Tianle Cai

@tianle_cai

3 months ago

Can we turn part of an LLM's weights into long-term memory that continuously absorbs new knowledge? We took a small step toward this with In-Place Test-Time Training (In-Place TTT) — accepted as an Oral at ICLR 2026 🎉 The key idea: no new modules, optional pretraining. We repurpose the final projection matrix in every MLP block as fast weights. With an NTP-aligned objective and efficient chunk-wise updates, the model adapts on the fly — complementing attention rather than replacing it. 📄 Paper: https://t.co/mtfkbptevk with amazing @Guhao_Feng @Roger98079446 Kai @GeZhang86038849 Di @HuangRubio

23

1K

144

767

79K

Roger98079446 retweeted

Yiping Lu

@2prime_PKU

3 months ago

Gradient-Lipschitz analysis can recovers the scaling behind muP！Studying how network width changes the gradient Lip constant under operator norms, we • recover muP scaling for Adam • Muon’s smoothness can be bad • New Row-wise gradient normalization is competitive with Muon

2prime_PKU's tweet photo. Gradient-Lipschitz analysis can recovers the scaling behind muP！Studying how network width changes the gradient Lip constant under operator norms, we
• recover muP scaling for Adam
• Muon’s smoothness can be bad
• New Row-wise gradient normalization is competitive with Muon https://t.co/zKRFR7jbEp

3

180

35

138

23K

Who to follow

Zhaocheng Zhu

@zhu_zhaocheng

Senior Research Scientist @nvidia. PhD @Mila_Quebec. BSc @PKU1898. Reasoning, LLMs, ML systems. Photographer. Opinions are my own.

Yuanqi Du

@YuanqiD

Researcher @MSFTResearch @MSRNE; Community builder @AI_for_Science

Guolin Ke

@guolin_ke

Machine Learning & AI for Science. Created #LightGBM, #Graphormer, #UniFold, Uni-Mol, Uni-3DAR. Former Senior Researcher @MSFTResearch. Opinions are my own.

Roger98079446 retweeted

Karan Dalal

@karansdalal

6 months ago

Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models. We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the same next-token prediction objective as training. With this end-to-end objective, our model can efficiently compress substantial context into its weights and still use it effectively, unlocking extremely long context windows for complex reasoning and applications in agents and robotics. Paper: https://t.co/tqPYECjFpn Code: https://t.co/tADD7wYDAL

karansdalal's tweet photo. Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models.

We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the same next-token prediction objective as training.

With this end-to-end objective, our model can efficiently compress substantial context into its weights and still use it effectively, unlocking extremely long context windows for complex reasoning and applications in agents and robotics.

Paper: https://t.co/tqPYECjFpn
Code: https://t.co/tADD7wYDAL

42

1K

208

956

186K

Roger98079446 retweeted

Ilya Sutskever

@ilyasut

7 months ago

One point I made that didn’t come across: - Scaling the current thing will keep leading to improvements. In particular, it won’t stall. - But something important will continue to be missing.

742

10K

803

2K

2M

Roger98079446 retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

about 1 year ago

Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value "We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing the training quality of mainstream diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models."

iScienceLuvr's tweet photo. Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value

"We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing the training quality of mainstream diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models."

1

146

27

64

9K

Roger98079446 retweeted

Songlin Yang

@SonglinYang4

over 1 year ago

Introducing the first open-source implementation of native sparse attention: https://t.co/OeUp1bH6h9. Give it a spin and cook your NSA model! 🐳🐳🐳

10

757

119

408

72K

Roger98079446 retweeted

Tianle Cai

@tianle_cai

over 1 year ago

Just grasped the true significance (not just bc it's submitted by Wenfeng) of this work after reading @SonglinYang4 's explanation. The breakthrough isn't hybrid attention (studied years ago), but the ingenious kernel that delivers real-world speedups for dynamic sparse attention. As someone who worked on efficient transformers in undergrad, I had the impression that combining "efficient attentions" (linear, sparse, conv, block-structured), which theoretically would be faster, had the potential to replace full attention but was practically slower. But Deepseek's solution is different: By having each query group of a token attend to the same KV block, they can really reduce the memory movement and achieve FlashAttention-like memory efficiency. This matters enormously for reasoning models that output long thinking processes (10k+ tokens). The efficient dynamic sparse kernel dramatically speeds up both training and inference for such models. What a brilliant example of algorithm-system co-design!

3

339

39

185

43K

Roger98079446 retweeted

Jiao Sun

@sunjiao123sun_

over 1 year ago

Mitigating racial bias from LLMs is a lot easier than removing it from humans! Can’t believe this happened at the best AI conference @NeurIPSConf We have ethical reviews for authors, but missed it for invited speakers? 😡

sunjiao123sun_'s tweet photo. Mitigating racial bias from LLMs is a lot easier than removing it from humans!

Can’t believe this happened at the best AI conference @NeurIPSConf

We have ethical reviews for authors, but missed it for invited speakers? 😡 https://t.co/BjClBR9Kyl

175

4K

773

516

2M

Roger98079446 retweeted

Noam Shazeer

@NoamShazeer

about 2 years ago

Character AI is serving 20,000 QPS. Here are the technologies we use to serve hyper-efficiently. [https://t.co/R14Jt9Z5yo ]

34

1K

181

1K

579K

Roger98079446 retweeted

Yuanqi Du

@YuanqiD

about 2 years ago

🧵1/7 Introducing an “Encyclopedia” of Molecular Design with Machine Learning @NatMachIntell! https://t.co/npDSVehRBT Collaboration with @arian_jamasb*, @JeffGuo__*, @TianfanFu, @charlieharris01, @yingheng_wang, @chenru_duan, @pl219_Cambridge, @pschwller, and Tom Blundell.

1

99

17

40

22K

Roger98079446 retweeted

Shubhendu Trivedi @_onionesque

about 2 years ago

Apropos of some real life discussions: We have superfast custom CUDA implementations for tensor-product-based (Clebsch-Gordan) equivariant NNs: https://t.co/GItpvKPfPi Based on the papers (and heavily optimized further!) https://t.co/I57bE5hIr7 and https://t.co/M7EKAq3qK8

1

42

6

32

8K

Roger98079446 retweeted

Kacper Kapuśniak @KKapusniak1

about 2 years ago

If data lives on a manifold, how do we design meaningful interpolations between marginals? We present Metric Flow Matching (MFM)… @PPotaptchik @TeoReu @leoeleoleo1 @AlexanderTong7 @mmbronstein @bose_joey @Francesco_dgv 🔗Dive in here: https://t.co/XwaU2nyaMT 🧵 (1/12)

8

381

68

263

108K

Roger98079446 retweeted

Xiang Fu

@xiangfu_ml

about 2 years ago

Charge density is the core attribute of atomic systems in DFT. When representing and predicting charge density with ML, it is challenging to balance accuracy and efficiency. We propose a recipe that achieves SOTA on both: https://t.co/mxKQczuKzF 1/5

xiangfu_ml's tweet photo. Charge density is the core attribute of atomic systems in DFT. When representing and predicting charge density with ML, it is challenging to balance accuracy and efficiency. We propose a recipe that achieves SOTA on both: https://t.co/mxKQczuKzF 1/5 https://t.co/C4XXos2t61

8

202

46

103

45K

Shengjie Luo @Roger98079446

about 2 years ago

Super Cool! Don't miss it!

Bohang Zhang @ICLR 2024 @bohang_zhang

about 2 years ago

#ICLR2024 Just arrived in Vienna! Don't miss our oral presentation tomorrow afternoon in room Halle A3, focusing on 𝗚𝗡𝗡𝘀 and their 𝗲𝘅𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗼𝘄𝗲𝗿! Also, swing by our poster session (Poster272, Halle B). See you there! 🌟

bohang_zhang's tweet photo. #ICLR2024 Just arrived in Vienna! Don't miss our oral presentation tomorrow afternoon in room Halle A3, focusing on 𝗚𝗡𝗡𝘀 and their 𝗲𝘅𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗼𝘄𝗲𝗿! Also, swing by our poster session (Poster272, Halle B). See you there! 🌟 https://t.co/CuhzRQ8dBP

0

42

4

6

4K

0

2

0

1

218

Shengjie Luo @Roger98079446

about 2 years ago

Experiment: Extensive experiments are conducted to verify the efficiency and generality of our approach. See our paper and code repository for more details! Paper: https://t.co/UOPRKA2lCd Code: https://t.co/7Sn4jg7ZRh Looking forward to your feedback! 10/10

Roger98079446's tweet photo. Experiment:
Extensive experiments are conducted to verify the efficiency and generality of our approach. See our paper and code repository for more details!

Paper: https://t.co/UOPRKA2lCd
Code: https://t.co/7Sn4jg7ZRh

Looking forward to your feedback!

10/10 https://t.co/Dl0GV7zSXf

0

2

0

96

Shengjie Luo @Roger98079446

about 2 years ago

#ICLR2024 Arrived Vienna! Happy to share our recent work 𝘁𝗼𝘄𝗮𝗿𝗱𝘀 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗮𝗻𝗱 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝗴𝗲𝗼𝗺𝗲𝘁𝗿𝗶𝗰 𝗱𝗲𝗲𝗽 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗼𝗿 𝘀𝗰𝗶𝗲𝗻𝗰𝗲! With incredible CTL and @ask1729! May 9 10:45am-12:45am (Poster254, Halle B). Details⬇️ (1/n)

Roger98079446's tweet photo. #ICLR2024 Arrived Vienna! Happy to share our recent work 𝘁𝗼𝘄𝗮𝗿𝗱𝘀 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗮𝗻𝗱 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝗴𝗲𝗼𝗺𝗲𝘁𝗿𝗶𝗰 𝗱𝗲𝗲𝗽 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗼𝗿 𝘀𝗰𝗶𝗲𝗻𝗰𝗲! With incredible CTL and @ask1729!

May 9 10:45am-12:45am (Poster254, Halle B).

Details⬇️ (1/n) https://t.co/BAWNVSEBzy

2

15

2

5

2K

Shengjie Luo @Roger98079446

about 2 years ago

Our Method4⃣: As a fundamental operation, our Gaunt Tensor Product can be applied to major operation classes that are widely used in E(3) equivariant networks. A comprehensive analysis is provided in our work: 9/n

Roger98079446's tweet photo. Our Method4⃣:
As a fundamental operation, our Gaunt Tensor Product can be applied to major operation classes that are widely used in E(3) equivariant networks. A comprehensive analysis is provided in our work:

9/n https://t.co/pM6lfY4pT3

1

2

1

0

137

Shengjie Luo

@Roger98079446

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users