I have been using browser harness by @browseruse
for the last 2 months and slowly but surely it is getter better at my workflows.
I really like the pattern of having domain skills so that next time the harness is used for the same or similar task, it can be done faster and with less tokens.
I have heavily customised the harness with custom scripts.
The best usecase I found is collab-use. I made this since I like using collab for training and fine tuning models for free on the TPUs. But waiting on the platform is a pain as they don't have any API.
collab skills started with a 300 step browser-harness workflow and now I have reduced it down to 70 step for reliably using collab for any general task.
Thanks for building this self-healing harness!! @mamagnus00@sauravpanda
@bycloudai@ClementDelangue Definitely. I am trying to adapt this methodology for increasing rl rollouts with individual nn blocks and see if it makes a difference.
A lot of things need to change tho. Right from grpo, infra and model alignment.
Recently when I was trying to increase the rollout throughput of my RL fine tuning pipeline, I noticed that the GPU stayed idle for long periods of time instead of actually serving the LLM.
After profiling, I realised there are several coldstart issues when you try to serve a model on vLLM (inference engine).
Two largest contributors to coldstart are vLLM inspect and torch.compile.
vLLM inspect - This inference engine supports a lot of different architectures and models - dense, MoE, multi-modal, speculative decoding, BF16, FP8, etc.
In order to create a reasonable executing plan it has to inspect all the layers of the model it is running, - layer shapes, attention heads, rotary embeddings, hidden dimensions, KV structure.
vLLM precomputes KV cache sizes, block allocation strategy, paged attention metadat, batching scheduler limits.
For large models like gpt-oss- 120b this becomes substantial.
Next is torch.compile = PyTorch compiles model graphs into optimized kernels. Most of the time it is pretty hard to beat these kernels on performance basis (although if you are good at GPU programming, can beat).
But in order to generate these optimised kernels, torch takes substantial time as it has to observe tensor shapes, control flow, operator patterns to generate stable graphs. These graphs are then used by the compiler to fuse matmuls, layernorm, activations and attention ops into fewer kernels. This is obviously expensive.
My next goal is to reduce this time as much as possible. Perhaps by techniques like cuda checkpointing and snapshots.
Will update with progress.
academics are unprepared for the coming world where much scientific progress is majorly a function of inference compute. whether OpenAI points the Eye of Stargate at your particular field will decide its acceleration. talent will leach away into the labs. it's already begun
we recently optimized qwen3.5-397b-a17b to be the fastest deployment publicly hosted.
and the crazy thing: we did it by writing CUSTOM KERNELS for AMD MI355x. πΏ
see our post below outlining how we optimized kernels to achieve SOTA performance.