light @reprompting - Twitter Profile

Pinned Tweet

14 days ago

I’ll be posting threads of chapter summaries (sort of) here. I’m still not sure how accountable I can keep myself, but one can always try.

light @reprompting

16 days ago

Programming Massively Parallel Processors (PMPP) has 22 chapters in total. If I do one chapter a day, it should take me about three weeks? 🤔 I want to do this.

0

1K

1

3

0

1

1K

light @reprompting

about 20 hours ago

3. ELL → improves coalescing through padding, but can waste huge amounts of memory 4. Hybrid ELL-COO → reduces ELL padding overhead 5. JDS → sorts rows by length to improve load balance while maintaining coalesced accesses

0

3

1

185

light @reprompting

about 20 hours ago

Summarizing chapter 14 of PMPP. This chapter focuses on sparse matrix computation, specifically Sparse Matrix-Vector Multiplication (SpMV). The main challenge is that most elements in a sparse matrix are zeros, so storing and processing them wastes memory bandwidth.

reprompting's tweet photo. Summarizing chapter 14 of PMPP.

This chapter focuses on sparse matrix computation, specifically Sparse Matrix-Vector Multiplication (SpMV). The main challenge is that most elements in a sparse matrix are zeros, so storing and processing them wastes memory bandwidth. https://t.co/GqUqad1lVD

1

62

8

34

3K

light @reprompting

about 20 hours ago

The chapter then walks through various sparse matrix storage formats: 1. COO -> simple, flexible, well-balanced but requires atomics 2. CSR -> removes atomics and improves storage efficiency, but introduces load imbalance and poor coalescing.

1

3

1

162

light @reprompting

about 20 hours ago

She has 7.5k+ citations and expertise in NLP/LLMs, yet she still had to thoroughly revise everything and go through all the coding-test shenanigans. It really is a dog-eat-dog world.

Alisa Liu @alisawuffles

2 days ago

I'm joining OpenAI next week!🥹 The job search turned out to be really challenging but also super rewarding, so I wrote a small blog to share what I learned along the way and hopefully make the process a little less mysterious for the next person. https://t.co/6FigSBdenD

467

13K

960

17K

4M

30

4K

142

1K

400K

light @reprompting

2 days ago

@thisispiyushK I was referring more generally to radix sort's non-comparison based approach, rather than partitioning itself.

0

1

0

26

light @reprompting

3 days ago

Summarizing chapter 13 of PMPP. This chapter mostly focuses on radix sort and how it can be efficiently parallelized on GPUs. It is kind of counterintuitive: you don’t compare elements, but instead repeatedly partition elements into buckets based on individual bits.

reprompting's tweet photo. Summarizing chapter 13 of PMPP.

This chapter mostly focuses on radix sort and how it can be efficiently parallelized on GPUs. It is kind of counterintuitive: you don’t compare elements, but instead repeatedly partition elements into buckets based on individual bits. https://t.co/brAttY6KBU

2

35

6

18

2K

light @reprompting

2 days ago

Naive CUDA radix sort using bit extraction, exclusive scan, and scatter. It really is just prefix sums doing all the heavy lifting.

reprompting's tweet photo. Naive CUDA radix sort using bit extraction, exclusive scan, and scatter. It really is just prefix sums doing all the heavy lifting. https://t.co/NXzrPeIRCN

0

107

11

75

5K

light @reprompting

3 days ago

This chapter then revisits several optimization techniques introduced earlier, memory coalescing, shared memory, and thread coarsening, to improve the efficiency of these operations and reduce memory overhead.

0

1

0

1

109

light @reprompting

3 days ago

Radix sort is particularly well-suited for gpus because each partitioning step can be expressed as parallel operations such as bit extraction, prefix sums (scans), and data scattering.

1

0

1

128

light @reprompting

3 days ago

Using this, each thread can independently identify its input ranges and perform its portion of the merge. The chapter then discusses performance considerations. It goes on from basic parallel merge to coalescing to circular buffers.

0

2

0

1

60

light @reprompting

3 days ago

Summarizing chapter 12 of PMPP. This chapter focuses on parallel merge. Merging two sorted arrays is straghtfroward sequentially. The challenge in parallelizing is that threads cannot immediately start merging.

1

6

0

2

437

light @reprompting

3 days ago

They first need to determine which portions of the two input arrays they are responsible for. For this, the chapter introduces the concept of co-ranking. Given a position k in the output array, co-ranking determines how many elements should come from array A and how many from B

1

2

0

1

81

light @reprompting

5 days ago

1. Kogge-Stone (exposes a large amount of parallelism but performs extra work) 2. Brent-Kung (performs much less work overall but exposes less parallelism) This chapter then discusses thread coarsening and hierarchical scans for large inputs.

0

1

0

1

107

light @reprompting

5 days ago

Summarizing chapter 11 of PMPP. This chapter focuses on prefix sum (scan), one of the most imp. parallel patterns. Many algorithms that appear inherently sequential can be reformulated as scans, making them suitable for parallel execution.

1

8

0

3

982

light @reprompting

5 days ago

The interesting part is that there isnt a single "best" scan algorithm. Different algorithm make different tradeoffs between parallelism and work efficiency. This chapter compares two classic approaches:

1

0

1

108

light @reprompting

6 days ago

Naive CUDA softmax using shared memory reduction. Reduction seems to be a pretty straightforward concept.

0

155

18

94

7K

light @reprompting

6 days ago

Chapter 10/22 (day 10) https://t.co/o2GFlkfCSP

light @reprompting

6 days ago

Summarizing chapter 10 of PMPP. This chapter focuses on reduction, a pattern that derives a single value (sum, max, min etc) from an array of values. While reduction is conceptually simple, its parallel implementation can be surprisingly inefficient.

1

6

0

1

662

0

59

light @reprompting

14 days ago

I’ll be posting threads of chapter summaries (sort of) here. I’m still not sure how accountable I can keep myself, but one can always try.

light @reprompting

16 days ago

Programming Massively Parallel Processors (PMPP) has 22 chapters in total. If I do one chapter a day, it should take me about three weeks? 🤔 I want to do this.

0

1K

1

3

0

1

1K

light @reprompting

6 days ago

Chapter 9/22 (day 9) https://t.co/rmfPrin6wf

light @reprompting

7 days ago

Summarizing chapter 9 of PMPP. This chapter introduces histogram computation which is different from most of the patterns that have been discussed so far. In matmul, conv or stencil computations, each output element is usually owned by a single thread.

1

7

0

1

252

1

0

62

light

@reprompting

Last Seen Users on Sotwe

Trends for you

Most Popular Users