Ramsha Khan @__ramshaaa__ - Twitter Profile

Ramsha Khan @__ramshaaa__

18 days ago

@Caffeinix_alche Eid mubarak 🫂✨

0

1

0

29

Ramsha Khan @__ramshaaa__

2 months ago

@thestoicccoder if that's gulab jamun & those on the left are pineapple and watermelon then why aren't they on separate plates! (just my OCD brain thinking xD)

0

2

0

58

__ramshaaa__ retweeted

Khushi🪐

@_khusheyyy

2 months ago

i feel like all roads lead to mathematics physics philosophy

308

9K

1K

932

176K

Ramsha Khan @__ramshaaa__

2 months ago

@jino_rohit 🔗https://t.co/3Bpm9nS4JX I've put a link to blog on online softmax at the beginning that I found quite helpful while learning. Btw, any feedback is appreciated 🙌

1

2

0

29

Ramsha Khan @__ramshaaa__

2 months ago

@jino_rohit This is cool. What resources are you referring to? I tried to write a blog putting together my understanding of flashattn, and I didn't go super deep into online softmax there (but getting to the code took a lot of notebook scribbling & dry runs to really convince my brain)

1

0

47

Ramsha Khan @__ramshaaa__

2 months ago

@ashanviii Cooool !

0

32

Ramsha Khan @__ramshaaa__

2 months ago

@mmaaz_98 @thinkymachines @miramurati Woww congratsss maaz!!

0

1

0

176

Ramsha Khan @__ramshaaa__

2 months ago

@weiiiisuiii Adding one more point: many options out there to create your resume, I prefer: https://t.co/GHKBS17xPC (you can check it out)

__ramshaaa__'s tweet photo. @weiiiisuiii Adding one more point: many options out there to create your resume, I prefer: https://t.co/GHKBS17xPC (you can check it out) https://t.co/rQsivNq8YV

1

2

0

4

756

Ramsha Khan @__ramshaaa__

2 months ago

With NVLink training is faster compared to without NVLink. If you're curious about NVLink, give this a read: https://t.co/WnqTMoPkXO Quick note: I might take a break from this series for a few days, will pick it up again soon!

0

3

0

94

Ramsha Khan @__ramshaaa__

2 months ago

Day 14 of learning distributed training: Exploring GPU topology 👇 X means the GPU is referring to itself (so yeah, no communication with itself xD) When you see PHB, it means GPUs are connected via the PCIe Host Bridge (CPU) so no direct GPU <-> GPU link (1/n)

__ramshaaa__'s tweet photo. Day 14 of learning distributed training:

Exploring GPU topology 👇
X means the GPU is referring to itself (so yeah, no communication with itself xD)
When you see PHB, it means GPUs are connected via the PCIe Host Bridge (CPU) so no direct GPU <-> GPU link
(1/n) https://t.co/WDa9B6isqe

Ramsha Khan @__ramshaaa__

2 months ago

Day 13 of learning distributed training: We've covered collective operations where multiple processes take part in communication. Now there’s this -> point-to-point communication (one-to-one) where you pass data from one specific process to another (not all processes). (1/n)

__ramshaaa__'s tweet photo. Day 13 of learning distributed training:
We've covered collective operations where multiple processes take part in communication.
Now there’s this -> point-to-point communication (one-to-one)
where you pass data from one specific process to another (not all processes).
(1/n) https://t.co/PUciZs6gbe

1

5

0

572

2

7

0

335

Ramsha Khan @__ramshaaa__

2 months ago

NVLink = direct GPU interconnect -> much higher bandwidth than PCIe So communication paths actually matter a lot for distributed training. I was reading a Hugging Face article where they compared DDP performance with & without NVLink, and the difference is pretty clear: (2/n)

2

3

0

122

Ramsha Khan @__ramshaaa__

2 months ago

https://t.co/jGt9Xd94gI

Ramsha Khan @__ramshaaa__

3 months ago

Hi people! It's been a while since I started exploring distributed training, so I thought I'll start sharing what I'm learning. (and will try to stay consistent) Right now I'm starting with parallelism strategies for model training. (1/3)

1

3

0

1

903

0

4

0

245

Ramsha Khan @__ramshaaa__

2 months ago

Both should NOT send at the same time they'll wait forever because both are sending and both will keep waiting to receive leads to deadlock. Don’t modify the tensor before .wait() when using non-blocking functions.

0

2

0

86

Ramsha Khan @__ramshaaa__

2 months ago

Send the tensor between processes using send() and recv() There's also isend() and irecv() - non-blocking functions > transfer can happen in the background while some other computation/work runs simultaneously (2/n)

1

2

0

87

Ramsha Khan @__ramshaaa__

2 months ago

Things NOT to do! Be careful while setting src & dst and the operation If process0 sends to process1, then process1 should receive from process0 (3/n)

1

2

0

81

Ramsha Khan @__ramshaaa__

2 months ago

Day 13 of learning distributed training: We've covered collective operations where multiple processes take part in communication. Now there’s this -> point-to-point communication (one-to-one) where you pass data from one specific process to another (not all processes). (1/n)

Ramsha Khan @__ramshaaa__

2 months ago

Day 12 of learning distributed training: We saw a linear relationship between workers and the steps needed for communication, so there was an assumption that latency doesn’t matter much. But in large distributed systems, that assumption breaks as latency is not negligible. (1/n)

__ramshaaa__'s tweet photo. Day 12 of learning distributed training:
We saw a linear relationship between workers and the steps needed for communication, so there was an assumption that latency doesn’t matter much. But in large distributed systems, that assumption breaks as latency is not negligible.
(1/n) https://t.co/wwPu9LlbIi

1

8

0

527

1

5

0

572

Ramsha Khan @__ramshaaa__

2 months ago

@sodakeyEatsMush Glad you liked it, thank youu!

0

1

0

21

Ramsha Khan @__ramshaaa__

2 months ago

A leaf that was idle in Tree A now becomes active in Tree B. So even if some GPUs are idle in one tree, they are active in the other.

0

2

0

45

Ramsha Khan @__ramshaaa__

2 months ago

Day 12 of learning distributed training: We saw a linear relationship between workers and the steps needed for communication, so there was an assumption that latency doesn’t matter much. But in large distributed systems, that assumption breaks as latency is not negligible. (1/n)

Ramsha Khan @__ramshaaa__

2 months ago

Day 11 of learning distributed training: Let's keep going with collective ops by zooming into All-Reduce and what's happening behind the scenes. So I found a couple of ways to do naive all-reduce: (1/n)

1

3

0

624

1

8

0

527

Ramsha Khan @__ramshaaa__

2 months ago

Then, to fix the utilization issue, Double Binary Trees were introduced. It’s a sweet spot between the previous two approaches. There are two trees: a root in Tree A acts as a leaf in Tree B. (3/n)

1

3

0

57

Ramsha Khan

@ramshaaa

Last Seen Users on Sotwe

Trends for you

Most Popular Users