Isha

Verified account

@slowdownisha

ML | 2nd year ug | i make things make sense (eventually)

Joined February 2022

524 Following

912 Followers

7.7K Posts

Pinned Tweet

3 months ago

Guys, the Blog is here! Why does Gradient descent struggle, and how does momentum completely change the game? I wrote a deep dive covering: + A quick intro to Gradient Descent + Where it breaks down + How momentum fixes those issues + The math behind it (intuitive + formal) + Code implementation Tried to make this enjoyable for both beginners and those who love the math side of things. Hope you all will like it. Feedback is always welcome! QT/RT appreciated <3

slowdownisha's tweet photo. Guys, the Blog is here!

Why does Gradient descent struggle, and how does momentum completely change the game?

I wrote a deep dive covering:
+ A quick intro to Gradient Descent
+ Where it breaks down
+ How momentum fixes those issues
+ The math behind it (intuitive + formal)
+ Code implementation

Tried to make this enjoyable for both beginners and those who love the math side of things.

Hope you all will like it. Feedback is always welcome!
QT/RT appreciated <3

14

89

12

54

8K

about 24 hours ago

@Harry_The_Nerd Thankk youu!!🥹

0

1

0

0

9

4 days ago

I’ve been hacking around with nanoGPT during endsems and ran a few small pretraining experiments. I moved from OpenWeb-style data to a mixed 100M-token dataset: - 70% FineWeb-Edu - 20% Cosmopedia - 10% Python code I picked this mix to make the data feel closer to modern pretraining- web/education text, explanation-style synthetic data, and some code. For the baseline, i trained a small decoder-only model in the nanoGPT style with learned absolute positional embeddings. Then I ran one clean architecture ablation: replacing learned positional embeddings with RoPE, while keeping the dataset, model size, optimizer, and token budget the same. Results: Baseline best val loss: 4.29455 RoPE best val loss: 4.04767 RoPE improved validation loss by 0.24688 under the same training budget. Throughput tradeoff: Baseline: ~480k tokens/sec RoPE: ~407k tokens/sec Main learning: RoPE gave better model quality, but was slightly slower because it adds extra computation inside attention. also a big thank you to @HotAisle for the compute. :)

slowdownisha's tweet photo. I’ve been hacking around with nanoGPT during endsems and ran a few small pretraining experiments.

I moved from OpenWeb-style data to a mixed 100M-token dataset:
- 70% FineWeb-Edu
- 20% Cosmopedia
- 10% Python code

I picked this mix to make the data feel closer to modern pretraining- web/education text, explanation-style synthetic data, and some code.

For the baseline, i trained a small decoder-only model in the nanoGPT style with learned absolute positional embeddings.
Then I ran one clean architecture ablation: replacing learned positional embeddings with RoPE, while keeping the dataset, model size, optimizer, and token budget the same.

Results:
Baseline best val loss: 4.29455
RoPE best val loss: 4.04767

RoPE improved validation loss by 0.24688 under the same training budget.

Throughput tradeoff:
Baseline: ~480k tokens/sec
RoPE: ~407k tokens/sec

Main learning: RoPE gave better model quality, but was slightly slower because it adds extra computation inside attention.

also a big thank you to @HotAisle for the compute. :)

3

41

4

16

4K

about 24 hours ago

@iamsmruti09 Yooo. Wlcm backk!

0

1

0

0

13

6 days ago

i forgot my laptop's password

1

3

0

0

139

2 days ago

@Gauri_the_great Thank youu gauriii♥️🫶

0

1

0

0

127

2 days ago

today's read!

slowdownisha's tweet photo. today's read! https://t.co/70HjsqsBTw

1

95

4

52

6K

slowdownisha retweeted

3 days ago

We are releasing a fully reproducible early preprint of "Prism: Unlocking Language Model Capability Extraction". A trained language model knows many things at once, but deployment usually asks for one behavior at a time. Enterprise scenarios often have few products, workflows, features, or use-cases matter disproportionately. Prism asks and answers a simple question - "Is it possible to isolate and deploy only capabilities that are driven by Pareto principle and cut down costs by a huge margin while preserving most of the performance?" This paper discusses a novel approach to efficiency, understanding model behavior and opens up capability extraction.

tokenbender's tweet photo. We are releasing a fully reproducible early preprint of "Prism: Unlocking Language Model Capability Extraction".

A trained language model knows many things at once, but deployment usually asks for one behavior at a time. Enterprise scenarios often have few products, workflows, features, or use-cases matter disproportionately.

Prism asks and answers a simple question - "Is it possible to isolate and deploy only capabilities that are driven by Pareto principle and cut down costs by a huge margin while preserving most of the performance?"

This paper discusses a novel approach to efficiency, understanding model behavior and opens up capability extraction.

21

210

40

125

21K

3 days ago

@tokenbender this is so cool!🔥

0

1

0

0

131

3 days ago

https://t.co/3KXlRzBBAh

0

2

0

0

107

3 days ago

@himanshustwts thnqyou:)

0

1

0

0

48

4 days ago

@HotAisle Thank you for the support 🫪

0

1

0

0

71

4 days ago

@curlysaarthak woww

0

1

0

0

81

11 days ago

@deephivex yayy🔥

0

2

0

0

61

11 days ago

@Harry_The_Nerd cwazyyy🔥

1

1

0

0

68

11 days ago

0

1

0

0

18

11 days ago

Starting with nanoGPT to understand the architecture end to end: data prep, tokenization, batching, transformer blocks, training loop, evals, checkpoints, and sampling. Once the pipeline becomes clear, I will start experimenting with different pretraining recipes ;)

slowdownisha's tweet photo. Starting with nanoGPT to understand the architecture end to end:
data prep, tokenization, batching, transformer blocks, training loop, evals, checkpoints, and sampling.

Once the pipeline becomes clear, I will start experimenting with different pretraining recipes ;) https://t.co/X1nx4O4D3o

13 days ago

- tried trick from @0xMukesh 's reply: to keep stats alive across merges, patch only the affected neighbors. - also started reading about RoPE.

slowdownisha's tweet photo. - tried trick from @0xMukesh 's reply: to keep stats alive across merges, patch only the affected neighbors.

- also started reading about RoPE. https://t.co/h7UiBKsbLw

1

15

1

2

2K

4

26

0

6

1K

11 days ago

@igorfomich i have just started, so still working through it myself, for longer context I would increase block size, then reduce batch size or adjust grad accumulation.

0

1

0

0

46

11 days ago

0

1

0

0

12

11 days ago

@Harry_The_Nerd thank youu so much !!🥰

0

1

0

0

18

12 days ago

@Harry_The_Nerd ayyyy🥳

1

1

0

0

99

Last Seen Users on Sotwe

Trends for you

Most Popular Users