so SGLang supports platform plugins now
I wanted to get a feel for the per-token overhead in the streaming phase
For mocked model streaming 500 tok/s the I measured 0.3 ms overhead per token
https://t.co/ZZBRWGgjmF
What's not yet perfect is having to build sglang from sources to use platform plugins
Made a PoC today to understand how Mooncake works
I was able to run Qwen 0.6B on CPU with Mooncake
Nano vLLM inspired
2xPrefill 1xDecode nodes
TCP transport mechanism
https://t.co/x1A44VWzaP
Entrypoint pd_two_prefill_chat.py
What if you could cut inference costs in half without giving up production-scale performance?
That's what @Tenstorrent Galaxy Blackhole, now live on the Cirrascale AI Innovation Cloud, was built to deliver:
โ Approximately half the cost of leading GPU alternatives
โ Bare-metal access (no virtualization overhead)
โ 90% of HuggingFace models run as-is, no rewrites required
โ Latency-optimized for large-context LLM inference and video generation
๐ Request access: https://t.co/C5PscwsnqX
Introducing Infinite Studio โพ.
Last week, @tenstorrent x @prodia announced the fastest Wan 2.2 video generation in the world.
We built a demo to show what that speed unlocks: directing an infinite movie in real time.
Demo ๐
.@tenstorrent is launching Galaxy Blackhole servers and clusters for fast LLM inference (DeepSeek-671B at up to 350 t/s/u), among other applications - no disaggregation needed, according to @jimkxa:
https://t.co/G2yu85tXs2