Spinning up inference engines such as vLLM, SGLang and TensorRT-LLM used to take up to 30 minutes
Today, we are introducing our elastic inference solution; scaling out engine replicas in mere *seconds*
Say goodby to idle GPUs
@hamishivi@yinn_oscar@RulinShao@TengX6@natolambert@HannaHajishirzi@waiorg Respect for actually shipping the sandbox layer! Your ceiling is container start, the docker daemon caps concurrent creates (the 64 cap + janitor). Fork off a warm snapshot and it's gone: sub-ms, no daemon.
Built Collimate for this, happy to share an open-instruct backend!