Jacob Helwig @JacobHelwig - Twitter Profile

Jacob Helwig @JacobHelwig

about 9 hours ago

@kalomaze @DomaOrut Were the predictions for the final hiddens not as good? Would be surprised if that were the case

0

16

Jacob Helwig @JacobHelwig

about 10 hours ago

@yoavgo @TacoCohen Then sample a bunch of latents and marginalize over them via the Bayes estimator for a given risk. LLM evals use 0-1 risk, in which case the Bayes estimator is the posterior mode. This is exactly the self-consistency=majority@N strategy

0

1

0

45

JacobHelwig retweeted

Shubham Parashar @Shubham09632806

9 days ago

Excited to share that Learnability-Informed Fine-Tuning of Diffusion Language Models (LIFT) has been accepted at ICML 2026! 🎉 paper: https://t.co/R0Hhi0NYoh code: https://t.co/TUsMWRnm5T

Shubham09632806's tweet photo. Excited to share that Learnability-Informed Fine-Tuning of Diffusion Language Models (LIFT) has been accepted at ICML 2026! 🎉

paper: https://t.co/R0Hhi0NYoh
code: https://t.co/TUsMWRnm5T https://t.co/dmeBirLqwL

1

20

5

12

2K

Jacob Helwig @JacobHelwig

10 days ago

@hi_tysam So you reduce exposure bias by adding random noise to model inputs. I wonder what would happen if noise more closely followed inference-time noise, eg by replacing input tokens with tokens sampled from model by forwarding model once before each train step

0

56

Who to follow

Yuanqi Du

@YuanqiD

Researcher @MSFTResearch @MSRNE; Community builder @AI_for_Science

26-Current CS PhD @ Fudan 22-25 MSCS @ ETH

JacobHelwig retweeted

Xiuyu Li

@sheriyuo

11 days ago

LIFT is the SFT recipe for dLLMs that actually understands the masking dynamics. Vanilla SFT on dLLMs often HURTS performance, and they finally pin down why. Their analysis: vanilla SFT overlooks learnability. Rare tokens are difficult to learn when most of the input is masked because the model has nothing to ground them in. Common tokens are easy and of little value to learn when most of the input is unmasked because the answer is essentially already given. LIFT aligns training with the information available at different diffusion time steps. Learn easy tokens when most of the input is masked (build up basic vocabulary at the noisy end), and learn hard tokens when more context is available (let the model use that context). The schedule matches the difficulty of each token to the moment the model is best positioned to absorb it. Learnability-Informed Fine-Tuning of Diffusion Language Models Paper: https://t.co/zUodVpjVgb Code: https://t.co/gLW9OrR4bO

sheriyuo's tweet photo. LIFT is the SFT recipe for dLLMs that actually understands the masking dynamics. Vanilla SFT on dLLMs often HURTS performance, and they finally pin down why.

Their analysis: vanilla SFT overlooks learnability. Rare tokens are difficult to learn when most of the input is masked because the model has nothing to ground them in. Common tokens are easy and of little value to learn when most of the input is unmasked because the answer is essentially already given.

LIFT aligns training with the information available at different diffusion time steps. Learn easy tokens when most of the input is masked (build up basic vocabulary at the noisy end), and learn hard tokens when more context is available (let the model use that context). The schedule matches the difficulty of each token to the moment the model is best positioned to absorb it.

Learnability-Informed Fine-Tuning of Diffusion Language Models
Paper: https://t.co/zUodVpjVgb
Code: https://t.co/gLW9OrR4bO

1

54

10

38

4K

Jacob Helwig @JacobHelwig

25 days ago

@StefanGliga @kalomaze Although maybe not realistic, the MDLM indep assumption is explicit

0

2

0

72

Jacob Helwig @JacobHelwig

25 days ago

@kalomaze This paper https://t.co/Npqw53q4Sx guarantees parallel joint distribution sampling using spec decode-style verification/rejection. The idea (fig 9) is: in the same forward pass, decode multiple tokens AND vfy tokens from last fwd. Diffusion is in the title, but closer to MTP IMO

0

2

0

1

196

Jacob Helwig @JacobHelwig

28 days ago

@marikgoldstein torchCFM has good code: https://t.co/8esp3NVoyh

0

2

0

148

Jacob Helwig @JacobHelwig

about 1 month ago

@gabriberton @LucaAmb Isn’t the main point of GQA and MLA to reduce KV cache size?

1

2

0

36

Jacob Helwig @JacobHelwig

about 1 month ago

@novasarc01 @teortaxesTex Yeah, multi-teacher OPD is super cool and was also used by GLM-5 and MiMo-V2-Flash. Remarkably, some insanely-cracked dev already shipped MOPD in VeRL: https://t.co/3Sot3Z7RDo

0

2

0

177

Jacob Helwig @JacobHelwig

3 months ago

@tak3sh8 ODEs are continuous in time, and ResNets correspond to forward-Euler discretizations of the dynamics w.r.t. @karpathy’s comment, SGD is the forward-Euler discretization of the gradient-flow ODE

0

6

0

150

Jacob Helwig @JacobHelwig

3 months ago

@eigenron I think TTRL did it better (more thorough experiments) https://t.co/mDCkRe6VbP

0

2

0

1

147

Jacob Helwig @JacobHelwig

3 months ago

@giffmana @francoisfleuret If in "pick up a good direction", "good" == "*locally* good", then I don't think your experiments contradict (A)

0

1

0

31

Jacob Helwig @JacobHelwig

3 months ago

@sasuke___420 Here's some good examples of torchrec: - BERT: https://t.co/lkCoD7NloZ - Decoder-only: https://t.co/pQXKHuIHgd

0

1

0

39

Jacob Helwig @JacobHelwig

3 months ago

@sasuke___420 As mentioned by another commenter, torch has sparse Adam, but nccl doesn't support sparse collectives, so I think it will only work with gloo backend. torchrec/fbgemm have some nice sparse optimizers, although they probably won't work with deepspeed/FSDP/Megatron

1

0

40

Jacob Helwig @JacobHelwig

3 months ago

@kalomaze @sasuke___420 Is it possible that people have reached the conclusion that WD on embeddings is bad due to the gradient sparsity issue? (ie, by applying WD on token embeddings that don't appear in the current batch)

0

1

0

34

Jacob Helwig @JacobHelwig

6 months ago

@nofreewill42 @tendies @TheVixhal @Yuchenj_UW @karpathy I think he’s aware, since vixhal is the one karpathy said it to

0

1

0

39

Jacob Helwig @JacobHelwig

12 months ago

(7/7) The two supersonic flow datasets we generated for evaluating ShockCast are available on HuggingFace: https://t.co/saZdK60wZd Paper: https://t.co/zOqH9urQjD Code: https://t.co/n1puP80rru

0

1

0

77

Jacob Helwig @JacobHelwig

12 months ago

We recently developed ShockCast, a deep learning framework for modeling high-speed flows using adaptive time-stepping (1/n)

1

7

2

0

452

Jacob Helwig @JacobHelwig

12 months ago

(6/n) We explore several physical priors to better align the neural CFL model with the classical CFL condition. We also introduce timestep conditioning strategies inspired by neural ODE and Mixture of Experts.

1

0

96

Jacob Helwig

@JacobHelwig

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users