Sagnik @sagnikcodes - Twitter Profile

@drummatick Harness is for overall not specific to a model right and router is just sending as per user request or other situations like long horizon tasks dividing b/w different models

0

17

Sagnik

@sagnikcodes

about 15 hours ago

Learn mandarin guys asap (requirements of an ai security lab)

0

2

0

45

Who to follow

Sufyan Shaikh

@SufyanShai87167

Software Engineer | YouTuber | Teacher CheckOut My YouTube Channel : https://t.co/xidI8iwW4d…

Pritam

@iPritamX

22 | Indie hacker | Building & Shipping iOS apps

byhan

@hantertidur

self-taught student who likes to discover || ngga self-taught lagi, bang ajarin dong bang

Sagnik

@sagnikcodes

1 day ago

@JustaLilPika Happy Birthday 🎉

0

1

0

20

Sagnik

@sagnikcodes

1 day ago

@sumanthraman I think companies should start layoff these boomers they just deserve it

1

2

0

230

Sagnik

@sagnikcodes

1 day ago

@Shreyas_Pandeyy @grok @grok? 🤧

1

0

32

Sagnik

@sagnikcodes

1 day ago

@Shreyas_Pandeyy @grok confirm it mf

1

0

1K

Sagnik

@sagnikcodes

1 day ago

@DevanshuXi Damn didn't knew gpu selling business can be that much touch i used to think it's same as selling the vms

0

729

Sagnik

@sagnikcodes

1 day ago

My girl just sent me this, am I cooked guys 😭😭

0

1

0

63

Sagnik

@sagnikcodes

1 day ago

@Kama_Kamilia This is so real

0

34

Sagnik

@sagnikcodes

2 days ago

I feel the memory layer will be solve by big ai labs like ,not startups like mem0 and supermemory, and it's almost solved ,each new upgrades from frontier labs add good memory layer to make good agentic memory, distributed training and inference keep evolving and there is no limit for that , compute was the real moat , still it is .

2

1

0

429

Sagnik

@sagnikcodes

3 days ago

Nice architecture, makes so much faster infer, during decode phase we generally load the whole kv block from the hbm , here in msa index branch we first index the kvs, now we have topk now in the search we can query the relevant kvs instead of loading all into shared memory I need a detail tech blog/report i don't understand alot of things so let's wait :)

Sagnik

@sagnikcodes

3 days ago

waiting for M3 and the report , benchmarks looks very good :)

1

0

134

0

1

0

90

Sagnik

@sagnikcodes

3 days ago

https://t.co/5ExQXkD2w2

0

38

Sagnik

@sagnikcodes

3 days ago

waiting for M3 and the report , benchmarks looks very good :)

1

0

134

Sagnik

@sagnikcodes

3 days ago

Going to random forests in the evening is my new hobby 🐈

1

3

0

74

Sagnik

@sagnikcodes

3 days ago

@hemlocktree12 @Pseudo_Sid26 😓😓😓

0

16

Sagnik

@sagnikcodes

4 days ago

The last time I profiled something was a java microservices app running in a k8s cluster and it was not a good experience, k8s hide alot of things while profiling with jfr , like I was storing and loading locally with JMC so it was not realtime, i applied stress test and recorded it to see any memory leaks , via the heap graphs and gc behaviour and all

Sagnik

@sagnikcodes

4 days ago

The main idea was simple profile what's happening inside , how each operation pass through cuBLAS and running in our gpu kernels eg. we take this y = x @ W + b (matmul + bias add) > diff args u can see in the code like size, compile ,warmup > first try with small 64 x 64 matrix , whole thing is done by cpu as very small matrix > try 4096 x 4096, now the actual GEMM kernel becomes visible and GPU computation dominates > first for startup and 2nd or warm up so main profiling start from the profilestep#2 > so first always warm up before start coz first start always have kernel load, cublas setup > large gap before aten:matmul is basically setup overhead > torch.compile make it faster due to dynamo lookup and all So before optimizing always profile :D In the next post i will take a raw model and profile it??