Shauray @Shauray7 - Twitter Profile

Pinned Tweet

4 months ago

Twinflow distillation blog for qwen-image-2512 is finally up: the idea: 2 NFE on a 20B flow model. standard twinflow gives you a performance ceiling (student can't beat teacher). so i bolted on latent-space rl gradients to escape the teacher's distribution + dynamic renoise sampling to stop early training from collapsing. did it work? kinda! checkpoints from step 5k-10k are on hf if you want to poke at them and figure out what i did wrong. my guess is the rl weight ramp needs to be sigmoidal not a hard switch and more steps! All the links to the blog, code and the checkpoints are in comments (just so that X does not limit this post).

Shauray

@Shauray7

5 months ago

Distilling Qwen-Image-2512 using TwinFlow, student sucking up knowledge from a monster teacher I think qwen-image is capable of doing far more then Z-image atleast on the realistic front (personal observations). Slashed batch times with MP and custom augs, 8xH200 pinned at max pretty much. Also added RL to the loop in theory it should get better then the teacher but that remains to be seen since the RL kicks in after 2k steps. The loss wont be very indicative of how the training is working I guess since its a distillation run, attaching some results on how it's going [at 1100 and 1600].

Shauray7's tweet photo. Distilling Qwen-Image-2512 using TwinFlow, student sucking up knowledge from a monster teacher I think qwen-image is capable of doing far more then Z-image atleast on the realistic front (personal observations). Slashed batch times with MP and custom augs, 8xH200 pinned at max pretty much. Also added RL to the loop in theory it should get better then the teacher but that remains to be seen since the RL kicks in after 2k steps. The loss wont be very indicative of how the training is working I guess since its a distillation run, attaching some results on how it's going [at 1100 and 1600].

0

11

1

5

8K

1

64

7

50

7K

Shauray

@Shauray7

about 7 hours ago

clever, I think descending only matters because the scheduler is dynamic. on a static strided persistent scheduler, it should be symmetric against reversal, so asc should give similar speedups? though won't call it LPT anymore would be just grouping equal-K tiles to balance CTA bands.

0

27

Shauray7 retweeted

subho ghosh

@SubhoGhosh02

about 11 hours ago

FA4's SingleTileLPTScheduler exploits that causal attention work grows with block index, so it just visits blocks in reverse (block = num_block - 1 - block). So why not try something similar on grouped gemm! In grouped GEMM the analog is that a tile's mainloop time is proportional to its group's K, and StaticPersistentGroupTileScheduler visits tiles in group-metadata order. So LPT = order groups by descending K. Result is 1.74x speedup in grouped gemm, just by sorting the scheduling path.

SubhoGhosh02's tweet photo. FA4's SingleTileLPTScheduler exploits that causal attention work grows with block index, so it just visits blocks in reverse (block = num_block - 1 - block).

So why not try something similar on grouped gemm! In grouped GEMM the analog is that a tile's mainloop time is proportional to its group's K, and StaticPersistentGroupTileScheduler visits tiles in group-metadata order. So LPT = order groups by descending K.

Result is 1.74x speedup in grouped gemm, just by sorting the scheduling path.

3

49

9

24

2K

Shauray

@Shauray7

about 21 hours ago

If you're buying a UGREEN NVMe enclosure for anything chip-specific (atleast in india), there new batches atleast on there USB3.x enclosures come with an RTL chip (RTL9210) even though they mention ASM chips on some of there enclosures, might just be silicon lottery across there batches, this is for USB3.x, there USB4 enclosures might still come with an ASM chip

Shauray7's tweet photo. If you're buying a UGREEN NVMe enclosure for anything chip-specific (atleast in india), there new batches atleast on there USB3.x enclosures come with an RTL chip (RTL9210) even though they mention ASM chips on some of there enclosures, might just be silicon lottery across there batches, this is for USB3.x, there USB4 enclosures might still come with an ASM chip

0

2

0

91

Who to follow

Matt

@m13v_

I’ll come work for your startup for free for a day, no strings attached. my way to give back to the community for outcomes I’ve achieved. read more https://t.co/qPFnmmHvHA

Divya Lalwani

@dee013_

Software & Blockchain Developer👩‍💻 | Creating better experiences via web3🤝 | Interested in tech, minds, security and networking 🌐.

SpiderX 

@HamzaEzzahiry

Software Engineer at @TM_Officiel ⚡ @1337FIL alumni | Got CS bachelor @CadiAyyadUniv Mobile Developer 💻 Stacks: #Flutter,Nextjs & Solidity,Kotlin 🚀

Shauray7 retweeted

Yifei Wang

@WangYw251

1 day ago

Pixel-space autoregressive generation. The model demonstrates strong generative performance while maintaining high linear probing accuracy. Can we unify image understanding and generation with AR? https://t.co/l9QA5vFerA

0

32

5

31

3K

Shauray

@Shauray7

1 day ago

@__tinygrad__ @bfspector @Etched https://t.co/sIYQG7q909

GEOFF

@geoffreywoo

2 days ago

we love @Etched 🚀 congrats @robertwachen @UbertiGavin and the entire team. @LoganPaul and I @antifund saw the early demos last winter and what this team is building is insane. excited to run my own frontier inference cluster in my garage 😂

geoffreywoo's tweet photo. we love @Etched 🚀

congrats @robertwachen @UbertiGavin and the entire team. @LoganPaul and I @antifund saw the early demos last winter and what this team is building is insane.

excited to run my own frontier inference cluster in my garage 😂 https://t.co/wAAjMwfDj1

24

518

9

95

97K

0

34

Shauray

@Shauray7

2 days ago

https://t.co/EUc8G6rbmx

0

4

0

1

54

Shauray

@Shauray7

4 days ago

https://t.co/9H4WhNqhPO

0

4

0

3

237

Shauray

@Shauray7

4 days ago

step 11600, changed a lot of stuff, had a measurement bug on PSNR (it is the only metric i have however bullshit it might be), I thought it was undersampling so added a 50 step sample (It wasn't), added a decode recon test in order to measure the right PSNR, and more importantly made the run true to the paper, fixed LR, global BS 64, and much more balanced sigma selection, I feel its just underconverged the paper mentions 100k steps I'm not even close to being done but its good to remove all paths to failure. [compare the images to the quoted once, noticeably less noisy now] more details on Github [https://t.co/sw0uiNFDmG]

Shauray7's tweet photo. step 11600, changed a lot of stuff, had a measurement bug on PSNR (it is the only metric i have however bullshit it might be), I thought it was undersampling so added a 50 step sample (It wasn't), added a decode recon test in order to measure the right PSNR, and more importantly made the run true to the paper, fixed LR, global BS 64, and much more balanced sigma selection, I feel its just underconverged the paper mentions 100k steps I'm not even close to being done but its good to remove all paths to failure.

[compare the images to the quoted once, noticeably less noisy now]

more details on Github [https://t.co/sw0uiNFDmG]

Shauray

@Shauray7

9 days ago

5800, moving slowly ofc, no visible changes, though I finally got to writing the readme for it and everything is in its place now ! links to everything below ofc, using the readme as a report here on what worked and the progress of it all.

Shauray7's tweet photo. 5800, moving slowly ofc, no visible changes, though I finally got to writing the readme for it and everything is in its place now !

links to everything below ofc, using the readme as a report here on what worked and the progress of it all. https://t.co/V2CpCmCFzu

1

6

0

553

0

9

1

2

331

Shauray

@Shauray7

7 days ago

finally got time to play around with krea 2 and yeah it does blow my mind how much this does out of the box, always wanted to make something similar to "into the spiderverse" inspired something, I guess lets see how well it can replicate those. btw I'm using it with the layer scaling from here https://t.co/PEDKXDdi8A

Shauray7's tweet photo. finally got time to play around with krea 2 and yeah it does blow my mind how much this does out of the box, always wanted to make something similar to "into the spiderverse" inspired something, I guess lets see how well it can replicate those.

btw I'm using it with the layer scaling from here https://t.co/PEDKXDdi8A

0

4

0

187

Shauray

@Shauray7

9 days ago

@__tinygrad__ tilelang

0

134

Shauray

@Shauray7

9 days ago

links to the training script and the dataset - https://t.co/q24yCPbeof

Shauray

@Shauray7

24 days ago

So some progress on this, got a decent overfit run, Using 6/6 blocks rather then 3/3 i think in the paper cause that just works better on my runs, the thing does memorizes the images in pixel space, single step recon is pretty gooood. from pure noise is a little soft but hey it was a small run, does plateau after a while but could be just a scaling thing. training more on low-noise steps does make it plateau a little later but it does plateau after a while. I remember talking to @NagaSaiAbhinay about Muon a few days back so tried that as well, Muon on 2D attn/mlp weights and AdamW for embeddings, norm and decoder did not quite beat AdamW on my test runs - could be cause most of the work is being done by the detailer head here and Muon just doesn't finetune an Adam-pretrained model well. attached a couple of samples from the overfit run and loss for adam and muon runs, sudden jumps in the loss is due to low sigma.

Shauray7's tweet photo. So some progress on this, got a decent overfit run, Using 6/6 blocks rather then 3/3 i think in the paper cause that just works better on my runs, the thing does memorizes the images in pixel space, single step recon is pretty gooood.

from pure noise is a little soft but hey it was a small run, does plateau after a while but could be just a scaling thing. training more on low-noise steps does make it plateau a little later but it does plateau after a while.

I remember talking to @NagaSaiAbhinay about Muon a few days back so tried that as well, Muon on 2D attn/mlp weights and AdamW for embeddings, norm and decoder did not quite beat AdamW on my test runs - could be cause most of the work is being done by the detailer head here and Muon just doesn't finetune an Adam-pretrained model well.

attached a couple of samples from the overfit run and loss for adam and muon runs, sudden jumps in the loss is due to low sigma.

2

15

0

6

2K

0

3

0

1

288

Shauray

@Shauray7

9 days ago

training script and other stuff- https://t.co/q24yCPbeof cleaned dataset - https://t.co/484VPZgPqT

0

1

0

56

Shauray

@Shauray7

9 days ago

5800, moving slowly ofc, no visible changes, though I finally got to writing the readme for it and everything is in its place now ! links to everything below ofc, using the readme as a report here on what worked and the progress of it all.

Shauray

@Shauray7

11 days ago

4600, there were no GPUs where I wanted for the whole of yesterday lets see how today goes, still on 2 GPUs I'm seriously thinking of setting grad accum to 2 to match the paper, also on the second image the guy started walking the right direction now (look at the same image in the quoted post below)

Shauray7's tweet photo. 4600, there were no GPUs where I wanted for the whole of yesterday lets see how today goes, still on 2 GPUs I'm seriously thinking of setting grad accum to 2 to match the paper,

also on the second image the guy started walking the right direction now (look at the same image in the quoted post below)

0

6

0

361

1

6

0

553

Shauray7 retweeted

Bohan Hou @bohanhou1998

10 days ago

We release TIRx today, a minimal compiler stack and hardware-native DSL for frontier ML kernels, built around storage-first tensor layouts and reusable tile primitives. https://t.co/V1yHdxlpCi On NVIDIA B200, TIRx delivers up to ~1.08× over cuBLASLt on dense GEMM, outperforms DeepGEMM on all FP8 blockwise workloads with up to ~1.09× speedup, keeps FlashAttention-4 (FA4) typically within ~±2% of CuTeDSL, and remains competitive with cuBLASLt/FlashInfer on NVFP4 GEMM. Through our past experiences building frontier ML kernels, megakernels, and agentic kernel systems, we kept seeing the same boundary problem: new operators and new hardware require new optimization strategies that often break old programming models or compiler passes. TIRx builds on top of Apache TVM and moves toward a simple goal: let users and agents express the best-performing program, even for future hardware generations, while keeping the engineering effort for new kernels and new hardware as low as possible.

bohanhou1998's tweet photo. We release TIRx today, a minimal compiler stack and hardware-native DSL for frontier ML kernels, built around storage-first tensor layouts and reusable tile primitives.
https://t.co/V1yHdxlpCi

On NVIDIA B200, TIRx delivers up to ~1.08× over cuBLASLt on dense GEMM, outperforms DeepGEMM on all FP8 blockwise workloads with up to ~1.09× speedup, keeps FlashAttention-4 (FA4) typically within ~±2% of CuTeDSL, and remains competitive with cuBLASLt/FlashInfer on NVFP4 GEMM.

Through our past experiences building frontier ML kernels, megakernels, and agentic kernel systems, we kept seeing the same boundary problem: new operators and new hardware require new optimization strategies that often break old programming models or compiler passes.

TIRx builds on top of Apache TVM and moves toward a simple goal: let users and agents express the best-performing program, even for future hardware generations, while keeping the engineering effort for new kernels and new hardware as low as possible.

4

142

44

99

38K

Shauray7 retweeted

NYRE

@sleenyre

9 days ago

Spend a lot of time writing this over the past week. If you want juicy details on our training stack and infra. Go read it! https://t.co/3F8jSgyI9l

9

162

28

91

20K

Shauray7 retweeted

Naga/Abhi

@NagaSaiAbhinay

10 days ago

A collection of LoRA's for Krea 2 Turbo! https://t.co/iPZkJ5wuKE

1

87

8

71

8K

Shauray7 retweeted

Krea

@krea_ai

10 days ago

magnet:?xt=urn:btih:2a644d0279182a022d08dd395ea593cfcc218e12&dn=https://t.co/qJoz3m67n0

42

212

24

103

55K

Shauray

@Shauray7

11 days ago

4600, there were no GPUs where I wanted for the whole of yesterday lets see how today goes, still on 2 GPUs I'm seriously thinking of setting grad accum to 2 to match the paper, also on the second image the guy started walking the right direction now (look at the same image in the quoted post below)

Shauray

@Shauray7

12 days ago

Step 3200

1

3

1

0

384

0

6

0

361

Shauray

@Shauray7

12 days ago

Step 3200

Shauray

@Shauray7

12 days ago

started the final (maybe) run, using the same 6/6 block with a BS of 16 across 2xH200s for now its on spot and the ladder should check for 8/4 x H100/H200 from here, attaching some samples from the latest step (2200)

Shauray7's tweet photo. started the final (maybe) run, using the same 6/6 block with a BS of 16 across 2xH200s for now its on spot and the ladder should check for 8/4 x H100/H200 from here, attaching some samples from the latest step (2200) https://t.co/46SKfhL8ex

0

4

0

487

1

3

1

0

384

Shauray

@Shauray7

12 days ago

@NagaSaiAbhinay Haha i wish the run produces output comparable to this

0

2

0

33

Shauray

@Shauray7

12 days ago

been traveling around kerala for a while now, pretty much the reason there are no updates on the L2P training. Though I did try deploying it on runpod 8xH100 spot ofc and for some reason I could not, I remember it being pretty nice not really sure what happened I was bidding like more then the on demand instances to atleast deploy something but it just did not let me. So I migrated to verda, again spot instances, going good as of now though I can only find like 2xH200 everytime like 4x and 8x instances does pop here and there but the 2x I have has been working for like 8-9 hours now without degradations for some reason. I might remove the 2x from the ladder, though waiting for a 4x or a 8x to be available and with its relativly quick degradation period it might just be fine running the 2x for the whole time I'm not sure maybe tell me in the comments. Here are some not so related pictures from here.

Shauray7's tweet photo. been traveling around kerala for a while now, pretty much the reason there are no updates on the L2P training.
Though I did try deploying it on runpod 8xH100 spot ofc and for some reason I could not, I remember it being pretty nice not really sure what happened I was bidding like more then the on demand instances to atleast deploy something but it just did not let me.
So I migrated to verda, again spot instances, going good as of now though I can only find like 2xH200 everytime like 4x and 8x instances does pop here and there but the 2x I have has been working for like 8-9 hours now without degradations for some reason.
I might remove the 2x from the ladder, though waiting for a 4x or a 8x to be available and with its relativly quick degradation period it might just be fine running the 2x for the whole time I'm not sure maybe tell me in the comments.
Here are some not so related pictures from here.

3

4

0

143

Shauray

@Shauray7

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users