@larsencc In pattern 1 I'm assuming that the definition of what is a "dangerous operation" adds complexity. That's why you chose the extra hop to the control plane. Otherwise if danger operations are well defined, low %, low risk, you wouldn't have opted to add latency to 100% of traffic.
@Ahmad_Al_Dahle I hope you sleep well at night when thousands of hosts and guests need help and want to talk to a real support staff to resolve an issue and get hit with a polite "no" from a bot.
@karpathy "Failure to claim the boost" is FOMO if stringing together the ai-enabled programmable layer does not yield expected performance.
No one wants to fall into the pit of being labeled as someone who overengineered with ai and failed. Traditional methods are applauded as raw skill.
@akshay_pachaar Would be interested in content surrounding kv cache because I am seeing a significant amount of variations in attempts of different rank factorization and cpu memory offloading techniques that are published in research. Less interested in constrained hardware compression methods.
3. Kernel Conflation. A flat CSV summary makes it impossible to distinguish if those slow matrix multiplications are coming from the Attention mechanism, the MLP blocks, or the Vocabulary Projection. Implement Hierarchical Instrumentation. Instrument model with nested scopes.
2. Installing torch for root via sudo pip install is a "dirty" fix that can break system python.
Use NVIDIA Container Toolkit. Running the profiler inside an official NVIDIA PyTorch Docker container often bypasses the weird user/root permission split.
1. If you are developing on a Windows machine with a GPU, the rule of thumb for hardware profiling is to always stay on the host OS. Use WSL2 for coding and compiling (Linux environment is better for PyTorch development), but execute the specific profiling command from PowerShell
View the segmented layout: each large cudaMalloc block split into active/inactive regions, with color-coded allocations packed in. You can zoom/pan and adjust detail level for dense runs.
This tool makes it way easier to debug why memory creeps up across iterations or hits OOM.
Shifting from compute to memory now. Captured a memory snapshot during a forward/backward pass using torch.cuda.memory._record_memory_history() and dumped it to a pickle.
Here's a quick screen recording of what it looks like in https://t.co/6dXTNV6Rbv ๐งต
Hovering over individual blocks shows exact sizes, addresses, and full Python stack traces back to the allocation site (super useful for spotting leaks or unexpected holds).
Below that, the Allocator State History timeline lets you select specific events (alloc/free/OOM)
In the Active Memory Timeline (the main plot), you see allocated memory on the Y-axis over time/events on the X-axisโclear spikes as activations and temporaries build up in the forward pass, then drop during backward as intermediates are freed.
@DavidSHolz I really appreciate the magazine that you gave out. I felt like it transformed the idea of Midjourney to be a tool for artists to do more truly creative art. The pieces featured in the mag were provoking, well curated, and felt truly like an artist magazine.
for _ in range(5): model(x) # warmup
torch.cuda.synchronize();
t0=time.time()
for _ in range(10): model(x) # measure
torch.cuda.synchronize()
print(f"{(time.time()-t0)/10*1000:.2f,"ms/it")
TIL: Never benchmark PyTorch with just time.time()
Hereโs why warmup + synchronize() actually matter:
When you run output = model(input), the CPU dispatches the order to the GPU queue and immediately moves to the next line of code.
The Fix...๐งต
- Loading Kernels: The GPU has to load the specific CUDA functions (kernels) into its instruction cache.
- Ramping Clock Speeds: Modern GPUs sit at a low frequency to save power. When the first big matrix multiply hits, it takes a few milliseconds for the hardware to ramp up.