prath_it_is @prathyusha2002 - Twitter Profile

Pinned Tweet

prath_it_is @prathyusha2002

about 5 years ago

I got my first internship and freelance gig at the same time, Now I don't know if I should celebrate and chill or go start working ASAP 😶

5

15

0

prath_it_is @prathyusha2002

7 months ago

vscode website is down

0

5

0

262

prathyusha2002 retweeted

alex zhang

@a1zhang

9 months ago

All the recordings for the @GPU_MODE x @scaleml series are up as a playlist in case you missed it 😁 There's so much value in these ~8 hours of lectures, from proving quantization error bounds on a whiteboard to a deep-dive into GPU warp schedulers! Plz take advantage of it!

a1zhang's tweet photo. All the recordings for the @GPU_MODE x @scaleml series are up as a playlist in case you missed it 😁

There's so much value in these ~8 hours of lectures, from proving quantization error bounds on a whiteboard to a deep-dive into GPU warp schedulers!

Plz take advantage of it! https://t.co/t30sCFmBK9

7

639

101

710

61K

prath_it_is @prathyusha2002

9 months ago

@catalinmpit @warpdotdev I use it! Except for higher RAM usage, everything else is amazing, especially autocorrect and autofill are >>>

0

86

Who to follow

Tiya

@TiyaTwts

software engineer at an investment bank eating clean everyday 🌱 essentially vegan

Brandi Mummery

@brandiCodes

Full Stack Engineer @StadiumLiveApp 🏟️

ogbhau.base.eth

@omkar_ghongade

fucking around, finding out

prath_it_is @prathyusha2002

9 months ago

@GithubProjects standup

0

1

0

10

prath_it_is @prathyusha2002

9 months ago

Ok, did anyone do RLVR and it actually worked?

0

61

prath_it_is @prathyusha2002

9 months ago

@dakshgup I just applied, I am very interested in Generalist SWE role!

0

1

0

93

prath_it_is @prathyusha2002

10 months ago

@rankdim When LLama or other models learnt these capabilities using SFT and then trained with RLVR, their performance increased too.

0

1

0

10

prath_it_is @prathyusha2002

10 months ago

@rankdim Until qwen models emerged, no other models worked well with RLVR. Qwen inherently had capabilities like verification and backtracking, which let RLVR improve the likelihoods of these capabilities

1

0

109

prath_it_is @prathyusha2002

10 months ago

@TheCinesthetic your name

0

22

prath_it_is @prathyusha2002

10 months ago

@aishwarya_2x21 Lagragian is basically gradient descent, good luck though

0

14

prath_it_is @prathyusha2002

10 months ago

I once put 1.5 ETH in a MetaMask wallet and was told to just forget about it. So I did. Now it’s worth $6.5k… and I don’t have the 12-word phrase 😭

4

3

0

216

prath_it_is @prathyusha2002

10 months ago

@aaditsh This is the system prompt we need

0

90

prath_it_is @prathyusha2002

10 months ago

@trashh_dev Mine think it is “America India”

0

7

prath_it_is @prathyusha2002

10 months ago

@alec_helbling I did and the loss went up during training

0

1

23

prath_it_is @prathyusha2002

10 months ago

Can I please get more eyes on this to validate this fact

0

74

prath_it_is @prathyusha2002

10 months ago

Just today years old when I learned that gradient descent is basically the same as Lagrangian mechanics, shoutout to my high-energy physics PhD friend for blowing my mind. 🤯

1

0

95

prath_it_is @prathyusha2002

10 months ago

@alec_helbling Meanwhile in RL where they use it as some sort of regularizer

0

50

prathyusha2002 retweeted

Daniel Han

@danielhanchen

10 months ago

OpenAI's OSS model possible breakdown: 1. 120B MoE 5B active + 20B text only 2. Trained with Float4 maybe Blackwell chips 3. SwiGLU clip (-7,7) like ReLU6 4. 128K context via YaRN from 4K 5. Sliding window 128 + attention sinks 6. Llama/Mixtral arch + biases Details: 1. 120B MoE 5B active + 20B text only Most likely 2 models will be released as per https://t.co/b0lszaF1eV - 120B MoE with 5B/6B active and a 20B dense probably (or MoE). Not multimodal most likely, just text for now. 2. Trained with Float4 maybe Blackwell chips MoE layers MLP are merged up / down probably with 8bit scaling factors and float4 weights. Most likely trained with Blackwell chips since they support float4. Or maybe PTQ to float4. 3. SwiGLU clip (-7,7) like ReLU6 Clips SwiGLU to -7 and 7 to reduce outliers and aid float4 quantization. Normally -6 to 6 is good for float4's range, but -7 and 7 is ok as well. 4. 128K context via YaRN from 4K Native 128K context extended via YaRN from 4K. Long context extension was done probably during mid-training. 5. Sliding window 128 + attention sinks SWA of 128 was used, but to counteract the SWA not remembering past info, attention sinks like in https://t.co/JMAERMJZGf was used. Maybe 4 / 8 vectors are used. TensorRT-LLM supports the flag "sink_token_length" for attention sinks https://t.co/n2Nqf7iOBt 6. Llama/Mixtral arch + biases Merged QKV, MLP and also biases are used on all modules it seems. MoE Router has bias as well. We discussed in @AiEleuther discord here: https://t.co/bxXVtkSUkB Credits to @apples_jimmy , @secemp9 and others in the Discord server for the discussions!