I just spent the last couple of days trying to derive the orbital mechanics for the Artemis II mission.
Wrote an article explaining everything you need to know(You'll realize that this is a HUGE leap for humanity):
https://t.co/mkILstlDFG
This video breaks my heart. It was taken at the beginning of the russian invasion in 2022. A Ukrainian boy was separated from his family at the boarder.
Russia is a terrorist state.
Kyiv endured 10h of this today, over 40 missiles and 500 drones. Nearly a third of the city without power and heating. All because tens of millions of Russians have nothing else to live for, nothing to aspire to, no plans or dreams but to kill and die for their fucking tsar.
Not the first time I've seen people spoiled by the safety of democracy casually argue that russian occupation would change nothing in their lives.
Under russian occupation, you don't vote — you disappear. Raped, persecuted, tortured, thrown into a basement prisons, mobilized to fight russia's wars of conquest. Your children taken, your language banned.
Yes. Exactly the same life, ffs.
russian occupation is not peace.
Russia attacked a grocery store in Zaporizhia. Many civilians were wounded, including children.
Whenever Putin faces difficulty on the front line, he begins killing innocent people in Ukraine with particular brutality. His army retreated in Kupyansk. Now he's taking revenge on civilians again.
I’ve built a prototype installer for the Stelline development image that lets you try it locally with no extra infrastructure. It’s powered by Docker and Jupyter Notebook, and includes scripts to simulate an observatory I/Q streaming network.
Once installed, you can reuse the built-in Stelline operators (Transport, Beamformer, Correlator, etc.) or develop your own Holoscan operators and plug them into the pipeline.
The installer was originally built for the DGX Spark, but it should work on any machine with an NVIDIA GPU. A ConnectX card is recommended for networking operators, but not required if you don’t plan to run them.
Preview: https://t.co/4gqJXt5feY
New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along.
(Remember matmul is the single most important operation that transformers execute both during training and inference. Most of NVIDIA compute is spent on it. Gaining 1% in efficiency translates to massive savings in the order of many nuclear reactors :P)
I, yet again, realized i underestimated the effort. 😅 Here is one more booklet (lol). 47 figures!
I covered:
* The fundamentals of the GPU architecture with an emphasis on the memory hierarchy, building mental models for GMEM, SMEM, and L1/L2, and then connecting them to the CUDA programming model. Along the way we also looked at the "speed of light," how it's bounded by power, with hardware reality leaking into our model.
* PTX/SASS, and how to steer the compiler into generating what we actually want (is that loop being unrolled, are we using vectorized loads like LDG.128, etc.). I've annotated one PTX/SASS example for a simple matmul kernel in excruciating detail. Even if you're new to compilers you should find this useful.
(i actually found various inefficiencies in both compilers - fun!)
* Many core concepts such as tile/wave quantization, occupancy, ILP (instruction-level parallelism), roofline model, etc. Also building intuition around fundamental equivalences: dot product as a sum of partial outer products, why square tiles are the right shape for high arithmetic intensity, etc.
* The warp tiling method - which is near SOTA assuming you can't use tensor cores, TMA, async mem instructions, and bf16. Just maximizing GPU's performance using nothing but CUDA cores, registers and shared memory.
* Finally, we step into Hopper (H100): TMA, swizzling, tensor cores and the wgmma instruction, async load/store pipelines, scheduling policies like Hilbert curves, clusters with TMA multicast, faster PTX barriers, and more.
As always lots of examples, lots of visuals. This is the first time i could see warp tiling kernel and be like "oh i get it completely". I just needed my mental image transformed into an actual image.
A few years ago I was really inspired by @Si_Boehm's excellent blog post on how matmul works, but I also found it had several errors, some unclear explanations, and it was quite outdated. Building on @pranjalssh amazing work (who did a great job building sota kernels for H100) and my own research, this is the final result.
---
Again a huge thank you to @Hyperstackcloud (GPU cloud) for giving me an H100 (PCIe) node to run some of the experiments and analysis that i needed to write this up.
Also a big thank you to my friends Aroun (who did a very thorough review of the post; Aroun's doing cool GPU/AI stuff at Magic and was previously GPU architect at Apple and Imagine, he's one of the best GPU people i know and we worked together on llm.c w/ @karpathy) and the amazing @marksaroufim! (PyTorch) for taking the time during weekend when they didn't have to. :)
11 years ago, 🇺🇦 Olena Kulish & Volodymyr Alyokhin got executed mafia-style in #Donbas. “Guilty” of supplying 🇺🇦 soldiers with food.
He was in IT. She was a popular radio host & animal rights activist. Russians killed her 6 dogs too. Just for the fun of it.
#RussiaUkraineWar
Today, russia killed these children. They were 8, 12, and 17.
A russian missile hit the Martyniuk family home in a small town in central Ukraine last night.
Father Ihor and mother Olena are in the hospital. She’s in critical condition.
Ukrainian journalist Viktoriia Roshchyna was tortured to death in Russian captivity, The Guardian
Her body was returned without eyes, brain, or larynx. Burn marks on her feet. Stab wounds. Broken rib. Signs of strangulation. 1/
14 people killed, including 6 children: Russia launched a ballistic strike on a residential area of Kryvyi Rih near a children's playground.
Those who talk about peace are showing their true intentions to the world. Russia wants nothing but war.