@KrisPatel99 New innovations in inference, quantization of weights, new kv compression and new attention, are all easily implemented with CUDA, difficult to do with TPU. They need to pay extra to maintain convertibility to TPU
@bboczeng This is a boat load of bullshit. One person’s agent load maybe bursty, the GPU cluster handles requests from large numbers of people’s agents. all of the GPU requests queue up to form batches and GB200 are perfectly suited to handle those