Programming Massively Parallel Processors (PMPP) has 22 chapters in total. If I do one chapter a day, it should take me about three weeks? 🤔
I want to do this.
Summarizing chapter 14 of PMPP.
This chapter focuses on sparse matrix computation, specifically Sparse Matrix-Vector Multiplication (SpMV). The main challenge is that most elements in a sparse matrix are zeros, so storing and processing them wastes memory bandwidth.
The chapter then walks through various sparse matrix storage formats:
1. COO -> simple, flexible, well-balanced but requires atomics
2. CSR -> removes atomics and improves storage efficiency, but introduces load imbalance and poor coalescing.
She has 7.5k+ citations and expertise in NLP/LLMs, yet she still had to thoroughly revise everything and go through all the coding-test shenanigans. It really is a dog-eat-dog world.
I'm joining OpenAI next week!🥹 The job search turned out to be really challenging but also super rewarding, so I wrote a small blog to share what I learned along the way and hopefully make the process a little less mysterious for the next person. https://t.co/6FigSBdenD
Summarizing chapter 13 of PMPP.
This chapter mostly focuses on radix sort and how it can be efficiently parallelized on GPUs. It is kind of counterintuitive: you don’t compare elements, but instead repeatedly partition elements into buckets based on individual bits.
This chapter then revisits several optimization techniques introduced earlier, memory coalescing, shared memory, and thread coarsening, to improve the efficiency of these operations and reduce memory overhead.
Radix sort is particularly well-suited for gpus because each partitioning step can be expressed as parallel operations such as bit extraction, prefix sums (scans), and data scattering.
Using this, each thread can independently identify its input ranges and perform its portion of the merge. The chapter then discusses performance considerations. It goes on from basic parallel merge to coalescing to circular buffers.
Summarizing chapter 12 of PMPP.
This chapter focuses on parallel merge. Merging two sorted arrays is straghtfroward sequentially. The challenge in parallelizing is that threads cannot immediately start merging.
They first need to determine which portions of the two input arrays they are responsible for. For this, the chapter introduces the concept of co-ranking.
Given a position k in the output array, co-ranking determines how many elements should come from array A and how many from B
1. Kogge-Stone (exposes a large amount of parallelism but performs extra work)
2. Brent-Kung (performs much less work overall but exposes less parallelism)
This chapter then discusses thread coarsening and hierarchical scans for large inputs.
Summarizing chapter 11 of PMPP.
This chapter focuses on prefix sum (scan), one of the most imp. parallel patterns. Many algorithms that appear inherently sequential can be reformulated as scans, making them suitable for parallel execution.
The interesting part is that there isnt a single "best" scan algorithm. Different algorithm make different tradeoffs between parallelism and work efficiency. This chapter compares two classic approaches:
Summarizing chapter 10 of PMPP.
This chapter focuses on reduction, a pattern that derives a single value (sum, max, min etc) from an array of values. While reduction is conceptually simple, its parallel implementation can be surprisingly inefficient.
Programming Massively Parallel Processors (PMPP) has 22 chapters in total. If I do one chapter a day, it should take me about three weeks? 🤔
I want to do this.
Summarizing chapter 9 of PMPP.
This chapter introduces histogram computation which is different from most of the patterns that have been discussed so far. In matmul, conv or stencil computations, each output element is usually owned by a single thread.