logprob

@logprob

phd student, deep learning

Joined January 2019

3.1K Following

138 Followers

479 Posts

Pinned Tweet

logprob @logprob

about 2 years ago

activation functions drama from: https://t.co/Ej0Yjb8i8u

logprob retweeted

Yijing Zhang @YijingZ91217

3 days ago

Neural PDE solvers have seen exciting progress! 🌊 But despite growing adoption, we still don’t know 𝘄𝗵𝗲𝗻 we should use them instead of classical solvers. 🤔 Our new paper has a surprising finding: the harder the PDE task, the more cost-effective learned solvers become. 🧵👇

YijingZ91217's tweet photo. Neural PDE solvers have seen exciting progress! 🌊
But despite growing adoption, we still don’t know 𝘄𝗵𝗲𝗻 we should use them instead of classical solvers. 🤔
Our new paper has a surprising finding: the harder the PDE task, the more cost-effective learned solvers become. 🧵👇 https://t.co/nUXv1FXZYU

28K

logprob retweeted

Silvia Sapora @silviasapora

1 day ago

After interviewing for Research Scientist roles at DeepMind, Isomorphic, Meta, Cohere and more, I wrote up everything I learned. Technical prep, logistics, negotiation, and emotional breakdowns. Check out my guide: https://t.co/eLh20ggMHW

257

194K

logprob retweeted

Flavio Martinelli @FlaviohMar

1 day ago

1/ NEW PAPER. Why do larger networks train better? "Because they contain more candidate subnetworks that can learn the task" → lottery tickets This explanation uses an appealing but misleading metaphor🧵 We propose an intuitive alternative grounded in theory: escape dimensions

FlaviohMar's tweet photo. 1/ NEW PAPER. Why do larger networks train better?

"Because they contain more candidate subnetworks that can learn the task" → lottery tickets
This explanation uses an appealing but misleading metaphor🧵

We propose an intuitive alternative grounded in theory: escape dimensions https://t.co/38XyKFsEtp

833

750

46K

Who to follow

Jarek Miszczak

@jarekadam

Researcher working on quantum computing, editor at https://t.co/rCPABTSkzD, https://t.co/QpBWz8ZHpC, manages @quantiki and @turarchaeonews

UmayuxLabs is a Colombian-based artificial intelligence start-up that builds AI products and services oriented to the enterprise from small to big scale apps .

logprob retweeted

Alex Naka @gottapatchemall

2 days ago

Few-step Cofolding with All-Atom Flow Maps https://t.co/R0KwqupBTp

logprob retweeted

Julien Blanchon 🇺🇦

@JulienBlanchon

3 days ago

I just spend 3 days on something that has been analytically proven impossible 😅

logprob retweeted

Phillip Isola @phillip_isola

4 days ago

What should be remembered: a compressed representation of the past that predicts the future (predictive state). How to update memory: predict the next predictive state.

phillip_isola's tweet photo. What should be remembered: a compressed representation of the past that predicts the future (predictive state).

How to update memory: predict the next predictive state. https://t.co/mF42GDKTgL

logprob retweeted

Akarsh Kumar

@akarshkumar0101

4 days ago

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable). What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels. We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients. Website: https://t.co/BvctWJlPad arXiv: https://t.co/5xR0mUVymp

$akarshkumar0101's tweet photo. We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable). What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels. We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients. Website: https://t.co/BvctWJlPad arXiv: https://t.co/5xR0mUVymp$

779

118

660

170K

logprob retweeted

Yifei Wang

@WangYw251

5 days ago

This might explain why REPA works: aligning DiT representations to DINOv2 may just be helping the model form these morphogen-like spatial gradients earlier and cleaner. One injects good representations, the other reveals them — same coin, two sides. Going further — what's the principle behind RAE?🤔

172

164

22K

logprob retweeted

Michael Tschannen @mtschannen

8 days ago

For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited that we're releasing our latest model aligned with this theme: Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs! 1/

mtschannen's tweet photo. For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited that we're releasing our latest model aligned with this theme:

Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs!

1/ https://t.co/4J2JKCtzU5

129

537

108K

logprob retweeted

Sasha Rush

@srush_nlp

8 days ago

On-Policy Distillation is the most active new research direction being explored in RL for LLMs. Had the chance to discuss how it works with Dwarkesh and why it fits so nicely into large-scale pipelines.

127

138K

logprob retweeted

Fei-Fei Li

@drfeifei

8 days ago

https://t.co/Kt50ttQRMJ

160

941

972K

logprob retweeted

Miles Cranmer

@MilesCranmer

8 days ago

For the past two years we've studied a decades-old problem in fluid dynamics: why do some turbulent systems grow 3x faster in the real world than simulations predict? With some tabletop fluids experiments and a physics foundation model, we finally have some results!!! 👇

211

128

16K

logprob retweeted

Niklas Rindtorff

@Niklas_TR

9 days ago

Introducing Strong Stochastic Flow Maps TLDR: Stochastic Flow Maps where we learn the stochastic solution path. Work led by Sam McCallum, @zwblasingame, with Timothy Herschelll, @AlexanderTong7, and @JamesFosterBath Arxiv: https://t.co/Hy8WWZOnjE Code: https://t.co/PMe6RoqyZA

362

305

73K

logprob retweeted

Yifei Zuo

@YifeiZuoX

13 days ago

For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones. paper: https://t.co/yAqClXrJUz code: https://t.co/D4pgIr1wM7 For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://t.co/85OzoOQlnF code: https://t.co/eqMYZ0U6qO

YifeiZuoX's tweet photo. For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention.

Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones.

paper: https://t.co/yAqClXrJUz
code: https://t.co/D4pgIr1wM7

For the origin of Parallax, check out the LLA paper at ICLR 2026:
paper: https://t.co/85OzoOQlnF
code: https://t.co/eqMYZ0U6qO

354

275

77K

logprob retweeted

Guanyu Zhou

@TMartyr4951

12 days ago

Everyone focused on JiT's move to pixel space. Today, JLT asks a different question: Can the benefits of clean prediction survive entirely within latent space? FID: 6.56 → 2.56✅ JLT learns to predict latent x directly, rather than velocity target v. Check our JLT:🧵

TMartyr4951's tweet photo. Everyone focused on JiT's move to pixel space. Today, JLT asks a different question:

Can the benefits of clean prediction survive entirely within latent space?

FID: 6.56 → 2.56✅ JLT learns to predict latent x directly, rather than velocity target v.

Check our JLT:🧵 https://t.co/vzaZJpDSh6

116

102

12K

logprob retweeted

Chinmay

@ChinmayKak

11 days ago

long overdue website overhaul. check it out, link in the comments:)

395

362

22K

logprob retweeted

alphaXiv

@askalphaxiv

12 days ago

"Learn from your own latents, not tokens: A Sample Complexity Theory" This paper explains why data2vec and JEPA can learn with much less data. They showed that when data has hidden hierarchy, token prediction becomes harder as the hierarchy gets deeper. But latent prediction keeps the learning problem simple at every level. Which suggests that models may learn faster when they stop predicting raw tokens and start predicting their own abstractions.

askalphaxiv's tweet photo. "Learn from your own latents, not tokens: A Sample Complexity Theory"

This paper explains why data2vec and JEPA can learn with much less data.

They showed that when data has hidden hierarchy, token prediction becomes harder as the hierarchy gets deeper. But latent prediction keeps the learning problem simple at every level.

Which suggests that models may learn faster when they stop predicting raw tokens and start predicting their own abstractions.

632

105

519

35K

logprob retweeted

Nicholas Boffi

@nmboffi

13 days ago

really excited to finally release this one. guidance is critical for getting flow and diffusion models to do what we want, but most methods in the literature are heuristic and work for unclear reasons. the field likes to frame it as reward-tilted sampling, yet what people run in practice is often nowhere close to that. here we take a different angle, deriving guidance from first principles as an optimal control problem. existing methods drop out as coarse approximations, and the flow map emerges as the fundamental ingredient for effective guidance. our approach is training-free, and reaches state-of-the-art performance across diverse benchmarks at up to 70x fewer NFEs. amazing work by @jrrhuang, justin, kartik, and sheel. stay tuned for more on the finetuning side!

151

117

15K

logprob retweeted

Michael Choi @michaelchchoi

15 days ago

michaelchchoi's tweet photo. https://t.co/IzJDfPQd1S

131

76K

logprob retweeted

hardmaru

@hardmaru

16 days ago

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall. We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal. This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://t.co/PK5h0mqQSo), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

154

637

741K

logprob

@logprob

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users