@drummatick Harness is for overall not specific to a model right and router is just sending as per user request or other situations like long horizon tasks dividing b/w different models
I feel the memory layer will be solve by big ai labs like ,not startups like mem0 and supermemory, and it's almost solved ,each new upgrades from frontier labs add good memory layer to make good agentic memory, distributed training and inference keep evolving and there is no limit for that , compute was the real moat , still it is .
Nice architecture, makes so much faster infer, during decode phase we generally load the whole kv block from the hbm , here in msa index branch we first index the kvs, now we have topk
now in the search we can query the relevant kvs instead of loading all into shared memory
I need a detail tech blog/report i don't understand alot of things so let's wait :)
The last time I profiled something was a java microservices app running in a k8s cluster and it was not a good experience, k8s hide alot of things while profiling with jfr , like I was storing and loading locally with JMC so it was not realtime, i applied stress test and recorded it to see any memory leaks , via the heap graphs and gc behaviour and all
The main idea was simple profile what's happening inside , how each operation pass through cuBLAS and running in our gpu kernels
eg.
we take this y = x @ W + b
(matmul + bias add)
> diff args u can see in the code like size, compile ,warmup
> first try with small 64 x 64 matrix , whole thing is done by cpu as very small matrix
> try 4096 x 4096, now the actual GEMM kernel becomes visible and GPU computation dominates
> first for startup and 2nd or warm up so main profiling start from the profilestep#2
> so first always warm up before start coz first start always have kernel load, cublas setup
> large gap before aten:matmul is basically setup overhead
> torch.compile make it faster due to dynamo lookup and all
So before optimizing always profile :D
In the next post i will take a raw model and profile it??