thom✨ @gpuwaster - Twitter Profile

gpuwaster retweeted

bayes

@bayeslord

2 days ago

They didn’t mean pause AI research, they meant pause *your* AI research

50

5K

346

167

101K

thom✨

@gpuwaster

3 days ago

46/100 of GPU Grind starting to work on a fp16 gemm kernel, playing with the __half api for now, all intrinsics it feels like i'm writing avx512 but in a cuda program. i'm setting up all the reference computations etc, and i was surprised to see the difference between fp64 flops and fp16 flops (i run it on a 2060 for now, going to run it on ampere ultimately to be able to use more features). like the fp64 to fp16 ratio for cuBLAS gemms is 1/36, which is not even that much considering the hardware peak of fp64 is 1/32 of fp32 which is 1/2 of fp16, it's just that i forgot the chip had that few fp64 cores. the way they say it in the whitepaper is literally "we just included bare minimum fp64 cores so that fp64 program can run correctly". i knew that at some point but i forgot and was still surprised 😅

gpuwaster's tweet photo. 46/100 of GPU Grind

starting to work on a fp16 gemm kernel, playing with the __half api for now, all intrinsics it feels like i'm writing avx512 but in a cuda program. i'm setting up all the reference computations etc, and i was surprised to see the difference between fp64 flops and fp16 flops (i run it on a 2060 for now, going to run it on ampere ultimately to be able to use more features).

like the fp64 to fp16 ratio for cuBLAS gemms is 1/36, which is not even that much considering the hardware peak of fp64 is 1/32 of fp32 which is 1/2 of fp16, it's just that i forgot the chip had that few fp64 cores. the way they say it in the whitepaper is literally "we just included bare minimum fp64 cores so that fp64 program can run correctly". i knew that at some point but i forgot and was still surprised 😅

thom✨

@gpuwaster

4 days ago

45/100 of GPU Grind reading more about quantization today, different datatypes etc, the idea of going from fp32 to fp16 then to fp8 or even fp4 is quite simple, it’s gonna be interesting to see how its implemented though; since these are not datatypes that are necessarily natively supported in c/c++ i didnt know fp8 was only supported since hopper, i think i’m going to work on a fp16 ampere hgemm first to play with the fp16 api, and then maybe a fp8 one on hopper which would also allow me to explore all the new hopper features

gpuwaster's tweet photo. 45/100 of GPU Grind

reading more about quantization today, different datatypes etc, the idea of going from fp32 to fp16 then to fp8 or even fp4 is quite simple, it’s gonna be interesting to see how its implemented though; since these are not datatypes that are necessarily natively supported in c/c++

i didnt know fp8 was only supported since hopper, i think i’m going to work on a fp16 ampere hgemm first to play with the fp16 api, and then maybe a fp8 one on hopper which would also allow me to explore all the new hopper features

0

8

1

8

550

0

1

0

52

thom✨

@gpuwaster

3 days ago

why no curandGenerateUniformHalf though 😢

0

14

thom✨

@gpuwaster

4 days ago

@filodesotano @ChShersh @Temporary111x ok i get it, super interesting thanks !

0

27

thom✨

@gpuwaster

4 days ago

@filodesotano @ChShersh @Temporary111x hmm ok but why does the main function being empty allows the execution to fall into unreachable() ?

1

0

53

thom✨

@gpuwaster

4 days ago

45/100 of GPU Grind reading more about quantization today, different datatypes etc, the idea of going from fp32 to fp16 then to fp8 or even fp4 is quite simple, it’s gonna be interesting to see how its implemented though; since these are not datatypes that are necessarily natively supported in c/c++ i didnt know fp8 was only supported since hopper, i think i’m going to work on a fp16 ampere hgemm first to play with the fp16 api, and then maybe a fp8 one on hopper which would also allow me to explore all the new hopper features

thom✨

@gpuwaster

5 days ago

43-44/100 of GPU Grind been working on a routines library, kinda BLAS-like except that i just put the routines im interested in, which means some that are not part of BLAS, and i dont necessarily implement all the transpose variants etc; its mostly just to group most of my kernel work in a single project. its not directly about kernels but it’s getting closer to something production-ready than just a bunch of separated projects with duplicated code, and im making progress in cpp! also it takes time now but it should allow me to be faster the next times i want to work on a new routine

1

6

0

2

774

0

8

1

8

550

gpuwaster retweeted

terminally onλine εngineer

@tekbog

4 days ago

they gonna make jensen a idol soon and put him in a boy band

4

107

3

5

7K

thom✨

@gpuwaster

4 days ago

@ChShersh @Temporary111x can you explain why?

1

0

250

gpuwaster retweeted

International Cyber Digest

@IntCyberDigest

5 days ago

Same guy btw

138

13K

909

2K

1M

thom✨

@gpuwaster

5 days ago

43-44/100 of GPU Grind been working on a routines library, kinda BLAS-like except that i just put the routines im interested in, which means some that are not part of BLAS, and i dont necessarily implement all the transpose variants etc; its mostly just to group most of my kernel work in a single project. its not directly about kernels but it’s getting closer to something production-ready than just a bunch of separated projects with duplicated code, and im making progress in cpp! also it takes time now but it should allow me to be faster the next times i want to work on a new routine

thom✨

@gpuwaster

8 days ago

42/100 of GPU Grind ran into some code with cuda graphs today and as i wasn't familiar enough with it i looked for some resources and found a lecture given for a oak ridge training series, by nvidia employees straight to the point, explaining clearly the point of using cuda graphs to reduce cpu launch overhead, and how to create one using either stream capture or by describing them manually (or mixing both!)

gpuwaster's tweet photo. 42/100 of GPU Grind

ran into some code with cuda graphs today and as i wasn't familiar enough with it i looked for some resources and found a lecture given for a oak ridge training series, by nvidia employees

straight to the point, explaining clearly the point of using cuda graphs to reduce cpu launch overhead, and how to create one using either stream capture or by describing them manually (or mixing both!)

0

345

1

6

0

2

774

thom✨

@gpuwaster

8 days ago

42/100 of GPU Grind ran into some code with cuda graphs today and as i wasn't familiar enough with it i looked for some resources and found a lecture given for a oak ridge training series, by nvidia employees straight to the point, explaining clearly the point of using cuda graphs to reduce cpu launch overhead, and how to create one using either stream capture or by describing them manually (or mixing both!)

thom✨

@gpuwaster

11 days ago

41/100 of GPU Grind watched yesterday GPU mode's lecture on PTX/SASS from Gestwell founders ; some insights on ptx and sass behaviors, and how to read it. i think one of the speakers even said he looked more at the generated ptx than the profiler when writing a kernel, found that surprising at first but i guess when you have enough expertise with ptx that makes sense, you directly understand how things are going to be from the ptx. they presented a tool they created to analyze generated PTX and compare it to what you would expect for a given algorithm for a given compute capability, it flags unexpected behaviors and you can review it manually. i thought it didn't make much sense on their first example, failed to see how you couldn't have drawn the same conclusion from looking at cpp source code, but for larger libraries to analyze a lot of compiled kernels at once, it looked super cool ! they ran it on cuBLAS (3553 kernels), and got 41k signals organized by priority etc, i guess such a tool would be useful for cuBLAS developers for example (if the signals are actually interesting)

gpuwaster's tweet photo. 41/100 of GPU Grind

watched yesterday GPU mode's lecture on PTX/SASS from Gestwell founders ; some insights on ptx and sass behaviors, and how to read it. i think one of the speakers even said he looked more at the generated ptx than the profiler when writing a kernel, found that surprising at first but i guess when you have enough expertise with ptx that makes sense, you directly understand how things are going to be from the ptx.

they presented a tool they created to analyze generated PTX and compare it to what you would expect for a given algorithm for a given compute capability, it flags unexpected behaviors and you can review it manually. i thought it didn't make much sense on their first example, failed to see how you couldn't have drawn the same conclusion from looking at cpp source code, but for larger libraries to analyze a lot of compiled kernels at once, it looked super cool ! they ran it on cuBLAS (3553 kernels), and got 41k signals organized by priority etc, i guess such a tool would be useful for cuBLAS developers for example (if the signals are actually interesting)

0

1

0

139

0

345

thom✨

@gpuwaster

11 days ago

@MainzOnX which usecases did people give you ? i've heard people using fp128 but never fp256

1

0

311

thom✨

@gpuwaster

11 days ago

the duality of a man (getting carried)

0

17

thom✨

@gpuwaster

11 days ago

41/100 of GPU Grind watched yesterday GPU mode's lecture on PTX/SASS from Gestwell founders ; some insights on ptx and sass behaviors, and how to read it. i think one of the speakers even said he looked more at the generated ptx than the profiler when writing a kernel, found that surprising at first but i guess when you have enough expertise with ptx that makes sense, you directly understand how things are going to be from the ptx. they presented a tool they created to analyze generated PTX and compare it to what you would expect for a given algorithm for a given compute capability, it flags unexpected behaviors and you can review it manually. i thought it didn't make much sense on their first example, failed to see how you couldn't have drawn the same conclusion from looking at cpp source code, but for larger libraries to analyze a lot of compiled kernels at once, it looked super cool ! they ran it on cuBLAS (3553 kernels), and got 41k signals organized by priority etc, i guess such a tool would be useful for cuBLAS developers for example (if the signals are actually interesting)

thom✨

@gpuwaster

12 days ago

40/100 of GPU Grind worked a little on a routines library i'm making, and after implementing the benchmark part, profiling the application with nsight systems i realized most of the time the application ran was spent in the vector initializations (i reset them to random numbers between each runs). this also allowed me to discover nvtx to instrument my code and find everything in the profiler. i don't time this initialization part so it doesn't matter for the result in itself, but it's still time i spend waiting in front of the screen, and modal credits i'm burning for no reason. so i replaced my manual method with calls to cuRAND, and it cut the total execution time of the benchmark by like x20 !! it's much better, now the main "bottleneck" regarding the sizes of problems i can benchmark is the kernel itself or the available memory (i still need to allocate them on cpu for result checking etc)

gpuwaster's tweet photo. 40/100 of GPU Grind

worked a little on a routines library i'm making, and after implementing the benchmark part, profiling the application with nsight systems i realized most of the time the application ran was spent in the vector initializations (i reset them to random numbers between each runs). this also allowed me to discover nvtx to instrument my code and find everything in the profiler. i don't time this initialization part so it doesn't matter for the result in itself, but it's still time i spend waiting in front of the screen, and modal credits i'm burning for no reason.

so i replaced my manual method with calls to cuRAND, and it cut the total execution time of the benchmark by like x20 !! it's much better, now the main "bottleneck" regarding the sizes of problems i can benchmark is the kernel itself or the available memory (i still need to allocate them on cpu for result checking etc)

0

117

0

1

0

139

gpuwaster retweeted

Paul Kuruvilla

@RohitPaulK

12 days ago

@BullTheoryio

27

19K

459

611

574K

thom✨

@gpuwaster

12 days ago

40/100 of GPU Grind worked a little on a routines library i'm making, and after implementing the benchmark part, profiling the application with nsight systems i realized most of the time the application ran was spent in the vector initializations (i reset them to random numbers between each runs). this also allowed me to discover nvtx to instrument my code and find everything in the profiler. i don't time this initialization part so it doesn't matter for the result in itself, but it's still time i spend waiting in front of the screen, and modal credits i'm burning for no reason. so i replaced my manual method with calls to cuRAND, and it cut the total execution time of the benchmark by like x20 !! it's much better, now the main "bottleneck" regarding the sizes of problems i can benchmark is the kernel itself or the available memory (i still need to allocate them on cpu for result checking etc)

thom✨

@gpuwaster

15 days ago

39/100 of GPU Grind continuing the cs149 parallel programmin course, watching lecture 2 ! it's about multi-core processors, SIMD concepts and examples with avx intrinsics, caches hierarchy etc it's quite interesting to see how there's bridges with how a gpu works everywhere in those explanations, for example the simd divergence issues / warp divergence issues

gpuwaster's tweet photo. 39/100 of GPU Grind

continuing the cs149 parallel programmin course, watching lecture 2 ! it's about multi-core processors, SIMD concepts and examples with avx intrinsics, caches hierarchy etc

it's quite interesting to see how there's bridges with how a gpu works everywhere in those explanations, for example the simd divergence issues / warp divergence issues

0

1

0

126

0

117

thom✨

@gpuwaster

12 days ago

paris boutta blow up

0

56

thom✨

@gpuwaster

12 days ago

@shreyansj yeah i think theyre done starting up

0

57

thom✨

@gpuwaster

15 days ago

39/100 of GPU Grind continuing the cs149 parallel programmin course, watching lecture 2 ! it's about multi-core processors, SIMD concepts and examples with avx intrinsics, caches hierarchy etc it's quite interesting to see how there's bridges with how a gpu works everywhere in those explanations, for example the simd divergence issues / warp divergence issues

thom✨

@gpuwaster

17 days ago

38/100 of GPU Grind i've seen screenshots of it on the tl multiple times, and it got my interest so i'm starting course cs149 on parallel computing from stanford, the lectures are available on youtube. i expect to know already most of the concepts from there since i just did a semester of high performance computing a few months ago, but i think it'll still be interesting to have some refreshers when i'm tired, and also see if stanford professors address the topic differently first lecture was really an introduction, making students perform parallel computing things (like counting the number of students in class) with different methods, introducing moore law's etc.. quite funny how most of the figures were the exact same we saw in class, i guess every professor use those

gpuwaster's tweet photo. 38/100 of GPU Grind

i've seen screenshots of it on the tl multiple times, and it got my interest so i'm starting course cs149 on parallel computing from stanford, the lectures are available on youtube. i expect to know already most of the concepts from there since i just did a semester of high performance computing a few months ago, but i think it'll still be interesting to have some refreshers when i'm tired, and also see if stanford professors address the topic differently

first lecture was really an introduction, making students perform parallel computing things (like counting the number of students in class) with different methods, introducing moore law's etc.. quite funny how most of the figures were the exact same we saw in class, i guess every professor use those

0

3

0

1

402

0

1

0

126

thom✨

@gpuwaster

17 days ago

38/100 of GPU Grind i've seen screenshots of it on the tl multiple times, and it got my interest so i'm starting course cs149 on parallel computing from stanford, the lectures are available on youtube. i expect to know already most of the concepts from there since i just did a semester of high performance computing a few months ago, but i think it'll still be interesting to have some refreshers when i'm tired, and also see if stanford professors address the topic differently first lecture was really an introduction, making students perform parallel computing things (like counting the number of students in class) with different methods, introducing moore law's etc.. quite funny how most of the figures were the exact same we saw in class, i guess every professor use those

thom✨

@gpuwaster

19 days ago

37/100 of GPU Grind i spent a lot of time setting up my development environment for CUB, and while my tests were building (multiple hours actually 🫠 ) i watched the gpu mode lecture on consumer gpus from Jake Cannel (vast ai), which was actually a lot of background on pre-cuda gpu programming, in the graphics ecosystem early 2000s and its evolution to stick with the topic of consumer gpus performance, the benchmarks in the talk show that you get more flops per dollar with a 4090 than with a H100, that might be common knowledge but i was actually quite surprised. however h100 gets obviously more flops, and also has more memory

gpuwaster's tweet photo. 37/100 of GPU Grind

i spent a lot of time setting up my development environment for CUB, and while my tests were building (multiple hours actually 🫠 ) i watched the gpu mode lecture on consumer gpus from Jake Cannel (vast ai), which was actually a lot of background on pre-cuda gpu programming, in the graphics ecosystem early 2000s and its evolution

to stick with the topic of consumer gpus performance, the benchmarks in the talk show that you get more flops per dollar with a 4090 than with a H100, that might be common knowledge but i was actually quite surprised. however h100 gets obviously more flops, and also has more memory

0

1

0

437

0

3

0

1

402

thom✨

@gpuwaster

Last Seen Users on Sotwe

Trends for you

Most Popular Users