π Launch today: Kog generates 3,000+ output tokens/s per single request, on standard datacenter GPUs.
We are bringing real-time LLM inference to hardware that companies already run in production.
The speed previously associated with purpose-built silicon is now delivered on NVIDIA H200 and AMD MI300X.
Today, we are opening our Tech Preview with a 2B coding model, with large frontier MoE support coming next.
Try our Playground β https://t.co/Rk9twi7Zdk
π₯ Why that matters, and how we did it β https://t.co/5WjgKGMuUS
π Monokernel deep dive β https://t.co/ivMYGMO18v
π Delayed Tensor Parallelism research β https://t.co/5yLMYQI2k8
read the thread π
@JonathanLeaders We'll find out soon (work in progress).
There are very good MoE models like GPT-OSS-120B or the latest Qwen that could be 1,000 - 3,000 tokens/s.
For the biggest SoTA MoE models (like DeepSeek V4) we are probably looking 500-1,000 tokens/s.
π Launch today: Kog generates 3,000+ output tokens/s per single request, on standard datacenter GPUs.
We are bringing real-time LLM inference to hardware that companies already run in production.
The speed previously associated with purpose-built silicon is now delivered on NVIDIA H200 and AMD MI300X.
Today, we are opening our Tech Preview with a 2B coding model, with large frontier MoE support coming next.
Try our Playground β https://t.co/Rk9twi7Zdk
π₯ Why that matters, and how we did it β https://t.co/5WjgKGMuUS
π Monokernel deep dive β https://t.co/ivMYGMO18v
π Delayed Tensor Parallelism research β https://t.co/5yLMYQI2k8
read the thread π
A great week in San Francisco for the Kog team at AMD AI DevDay 2026.
Presenting some of our latest work was a real pleasure, and the conversations around research, infrastructure, and product were especially valuable.
One signal came through very clearly.
Inference is moving much closer to the center of the conversation.
That gave us even more conviction that we are building in the right direction.
More to come soon.
I had to test it myself to believe this unreal inference speed.
3,000 tokens/s for 1 user on standard datacenter GPUs.
They leveraged a hidden efficiency gap in how GPUs generate tokens.
@Kog__AI just achieved 3,000 tokens/s on 8Γ AMD MI300X GPUs and 2,100 on 8Γ NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds.
That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels.
Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem.
For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the modelβs active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing.
Normal inference stacks keep breaking that flow.
They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token.
Kogβs answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture.
The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips.
They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs.
On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request.
Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.
π Launch today: Kog generates 3,000+ output tokens/s per single request, on standard datacenter GPUs.
We are bringing real-time LLM inference to hardware that companies already run in production.
The speed previously associated with purpose-built silicon is now delivered on NVIDIA H200 and AMD MI300X.
Today, we are opening our Tech Preview with a 2B coding model, with large frontier MoE support coming next.
Try our Playground β https://t.co/Rk9twi7Zdk
π₯ Why that matters, and how we did it β https://t.co/5WjgKGMuUS
π Monokernel deep dive β https://t.co/ivMYGMO18v
π Delayed Tensor Parallelism research β https://t.co/5yLMYQI2k8
read the thread π
The monokernel idea was one of their powerful trick.
Instead of launching many small GPU programs for normalization, attention, feed-forward layers, sampling, and communication, Kog keeps the whole decode loop inside 1 long-running GPU program.
With a monokernel, weights for the next stage can start loading while the current stage is still finishing, so the GPU behaves more like a pipeline and less like a machine constantly being paused and restarted.
If a Transformer layer is broken into many small GPU programs, the system can burn a scary amount of its budget just stopping, starting, syncing, writing, reloading, and waiting, before doing useful token generation.
The monokernel tries to remove that stop-start behavior.
Once it begins, it stays resident on the GPU and handles the full sequence, including prefill, decode, sampling, tensor-parallel communication, reductions, and internal state, without going back to the CPU for every little step.
The big gain is that weight streaming stays continuous.
For batch-size-1 inference, the GPU mostly needs to stream active model weights from high-bandwidth memory into compute units as smoothly as possible.
Read more about their βmonokernelβ implementation here.
https://t.co/orFgJ29gX5
Today, we're releasing LFM2.5-8B-A1B, a device-optimized model designed to power real-life applications on phones, laptops, PCs, robots, and fast & lightweight server-side use-cases.
> 8B MoE, 1.5B active
> Expanded 128K context
> LFM2.5 flagship hybrid MoE architecture
> Trained on 38T tokens + large-scale RL
> fast, reliable tool calling, punching above its weight, comparable to models with up to 4x its size
> customizable on a single GPU for any specialized task
> LFM2 open-weight license
π§΅
We expect our thousands-token-on-GPU speed results to scale way past our 2B-parameter preview model.
Single-request decoding depends on active-parameter count per token, not total parameters, and current frontier MoE models only activate a fraction.
For instance, DeepSeek-V4-Flash has 284B total, 13B active. With its default FP4/FP8 quantization, the math checks out for 1,000 to 3,000 tokens/s on current datacenter GPUs.
We plan to support frontier MoE models in Kog Inference Engine in the coming months, on GPU hardware enterprises and sovereign-AI builders already own. No proprietary silicon needed.
π Try it β https://t.co/Rk9twi8x2S
π Take a deeper look at our claims β https://t.co/5bgxMKGKgK
π οΈ Build with it β https://t.co/bwRfpO7Q58 (building fast agents? We're taking design partners!)
We also introduce Delayed Tensor Parallelism (DTP) to minimize inter-GPU communication wait-time.
Standard tensor parallelism ends each module with a blocking all-reduce operation.
DTP is a Transformer architecture variant that makes those all-reduce operations asynchronous, hence reducing inference communication overhead to 0.
DTP matches the quality of standard TP, and edges out similar methods like Ladder Residual and PT-Transformer in our setting.
π Full post β https://t.co/5yLMYQI2k8
Learn how Kog AI utilizes creative GPU Engineering optimizations to build a real-time LLM inference engine on AMD Instinct GPUs.
This is the video recording of our talk at AMD AI DevDay 2026 in SF two weeks ago.
Ping me directly for the slides π
Feedbacks welcomed.
https://t.co/jS9Epm8TSr
π Inspired to be at @AMD AI Dev Day in San Francisco today where the AI community is converging to shape whatβs next.
The energy is exceptional. Every keynote, demo, and conversation reinforces a deeper truth: AI has moved beyond advancing computing. It is now redefining human potential and possibility at global scale. At AMD, weβre proud to be at the center of this transformation, accelerating it with open, powerful, and accessible technology.
A personal highlight was spending time with visionary founders and CEOs including @ComfyUI, Midjourney, @AnythingLLM, @Kog__AI, Starcluster and many more. These builders are turning bold ideas into real-world impact, and it was an honor to put Ryzen AI Max powered @HP ZBook mobile workstations in their hands so they can move even faster.
This is leadership in the AI era: empowering creators, fueling ambition, and turning visionary ideas into scalable reality.
The future isnβt coming. Itβs being built right here, right now. Excited to keep pushing the boundaries alongside this incredible community.