Lily Su @excelsiorpred - Twitter Profile

about 1 month ago

@zhzHNN Hey! I'd love to grab coffee or lunch at MLSys. I'd love to learn more about what you are working on and share conference notes. Let me know

1

0

70

Lily Su @excelsiorpred

about 1 month ago

@TheNoise2Signal Hey! I'd love to grab coffee. I'm free Tuesday 11am–2:45pm, or Wed/Thu before 1pm or during lunch. What works for you?

0

18

Lily Su @excelsiorpred

about 1 month ago

@mat_jacob1002 Hey! I'd love to grab coffee. I'm free Tuesday 11am–2:45pm, or Wed/Thu before 1pm or during lunch. What works for you?

1

0

89

Lily Su @excelsiorpred

4 months ago

@larsencc In pattern 1 I'm assuming that the definition of what is a "dangerous operation" adds complexity. That's why you chose the extra hop to the control plane. Otherwise if danger operations are well defined, low %, low risk, you wouldn't have opted to add latency to 100% of traffic.

1

0

160

Lily Su @excelsiorpred

6 months ago

@Ahmad_Al_Dahle I hope you sleep well at night when thousands of hosts and guests need help and want to talk to a real support staff to resolve an issue and get hit with a polite "no" from a bot.

0

51

Lily Su @excelsiorpred

6 months ago

@karpathy "Failure to claim the boost" is FOMO if stringing together the ai-enabled programmable layer does not yield expected performance. No one wants to fall into the pit of being labeled as someone who overengineered with ai and failed. Traditional methods are applauded as raw skill.

0

9

Lily Su @excelsiorpred

6 months ago

@akshay_pachaar Would be interested in content surrounding kv cache because I am seeing a significant amount of variations in attempts of different rank factorization and cpu memory offloading techniques that are published in research. Less interested in constrained hardware compression methods.

0

1

0

13

Lily Su @excelsiorpred

6 months ago

@Anthony_Bonato Very few people, maybe fewer than your audience value sculptures, architecture, furniture.

0

14

Lily Su @excelsiorpred

6 months ago

3. Kernel Conflation. A flat CSV summary makes it impossible to distinguish if those slow matrix multiplications are coming from the Attention mechanism, the MLP blocks, or the Vocabulary Projection. Implement Hierarchical Instrumentation. Instrument model with nested scopes.

0

2

0

283

Lily Su @excelsiorpred

6 months ago

Been MIA due to leaky nose and leaky abstractions. Here are 3 things I learned battling nsys profiling on Windows wsl2. tdlr: just don't... 🧵

3

1

0

1

296

Lily Su @excelsiorpred

6 months ago

2. Installing torch for root via sudo pip install is a "dirty" fix that can break system python. Use NVIDIA Container Toolkit. Running the profiler inside an official NVIDIA PyTorch Docker container often bypasses the weird user/root permission split.

0

234

Lily Su @excelsiorpred

6 months ago

1. If you are developing on a Windows machine with a GPU, the rule of thumb for hardware profiling is to always stay on the host OS. Use WSL2 for coding and compiling (Linux environment is better for PyTorch development), but execute the specific profiling command from PowerShell

0

110

Lily Su @excelsiorpred

6 months ago

View the segmented layout: each large cudaMalloc block split into active/inactive regions, with color-coded allocations packed in. You can zoom/pan and adjust detail level for dense runs. This tool makes it way easier to debug why memory creeps up across iterations or hits OOM.

0

54

Lily Su @excelsiorpred

6 months ago

Shifting from compute to memory now. Captured a memory snapshot during a forward/backward pass using torch.cuda.memory._record_memory_history() and dumped it to a pickle. Here's a quick screen recording of what it looks like in https://t.co/6dXTNV6Rbv 🧵

3

1

0

1

109

Lily Su @excelsiorpred

6 months ago

Hovering over individual blocks shows exact sizes, addresses, and full Python stack traces back to the allocation site (super useful for spotting leaks or unexpected holds). Below that, the Allocator State History timeline lets you select specific events (alloc/free/OOM)

0

59

Lily Su @excelsiorpred

6 months ago

In the Active Memory Timeline (the main plot), you see allocated memory on the Y-axis over time/events on the X-axis—clear spikes as activations and temporaries build up in the forward pass, then drop during backward as intermediates are freed.

0

34

Lily Su @excelsiorpred

7 months ago

@DavidSHolz I really appreciate the magazine that you gave out. I felt like it transformed the idea of Midjourney to be a tool for artists to do more truly creative art. The pieces featured in the mag were provoking, well curated, and felt truly like an artist magazine.

0

2

0

39

Lily Su @excelsiorpred

7 months ago

for _ in range(5): model(x) # warmup torch.cuda.synchronize(); t0=time.time() for _ in range(10): model(x) # measure torch.cuda.synchronize() print(f"{(time.time()-t0)/10*1000:.2f,"ms/it")

0

156

Lily Su @excelsiorpred

7 months ago

TIL: Never benchmark PyTorch with just time.time() Here’s why warmup + synchronize() actually matter: When you run output = model(input), the CPU dispatches the order to the GPU queue and immediately moves to the next line of code. The Fix...🧵

5

1

0

191

Lily Su @excelsiorpred

7 months ago

Solution: 1. torch.cuda.synchronize() before and after the timed section 2. 3–10 warmup iterations 3. Then measure (and average many runs)

0

110

Lily Su @excelsiorpred

7 months ago

- Loading Kernels: The GPU has to load the specific CUDA functions (kernels) into its instruction cache. - Ramping Clock Speeds: Modern GPUs sit at a low frequency to save power. When the first big matrix multiply hits, it takes a few milliseconds for the hardware to ramp up.

0

93

Lily Su

@excelsiorpred

Last Seen Users on Sotwe

Trends for you

Most Popular Users