Launching Cacheon: an open, incentivized competition for LLM inference optimization.
As model quality converges, the next frontier is serving them economically at scale: lower latency, higher throughput, and lower cost per token.
Cacheon turns that problem into a live arena with continuous evaluation. Developers submit containerized inference servers, benchmarked on standardized hardware against a pinned vLLM baseline. The fastest server that preserves output correctness wins.
The goal is to make better inference systems discoverable, measurable, deployable, and rewarded in the open.
Mainnet launches by May 19. Learn more: https://t.co/JPbyJpLszq
Cacheon competition restarts June 1. What we overhauled this week:
- One-pass eval: speed + correctness on the same prompts, same outputs
- Single metric: end-to-end wall time vs baseline (no TTFT / TPS split)
- Improved logging and telemetry
- Emissions ramp up after June 8
- Conviction lock soon
Inference is the compute layer everything runs on. Open competition is how we surface the best. Miners, show us what you've got. 💪
Read more: https://t.co/xZE4ZtjTVW
We shipped two things over the weekend: a 0.1 TAO miner submission fee and @shadeformai GPU support.
Submission fee: Every on-chain commit now costs 0.1 TAO. Goal is to cut spam and add skin in the game. Fee covers GPU rental first; anything left buys SN14 tokens and burns it. Miner workflow is unchanged. Docs: https://t.co/IfKldbyBTh
Shadeform GPUs: Validator can now pull GPUs from Shadeform alongside @TargonCompute and @lium_io
for evaluations. More supply, less wait time.
Updates on evaluation upgrade coming later this week. Follow our Bittensor Discord channel for ongoing discussions.
Day 1 of @cacheon_ai is in the books. Stressful? 100%
We shipped, broke things, and found multiple exploits. All in one day.
That is how it is supposed to go. We are building the infrastructure that makes inference competitive with centralized providers. That is not a small problem. You do not find the real edges until real miners are running real models on a live subnet. We found them fast and want to patch them even faster.
Day 1 stress means people care enough to try to break the system. That is what makes this product stronger.
Evaluation and emission are paused until the fixes are in. More updates before end of week.
To every miner who stayed in and flagged issues: thank you!
Didn't expect 40+ miners on testnet.
People ran real inference servers for nothing but early positioning. That kind of participation before any reward is on the table tells you something.
To every testnet miner: you built this community before it had a dollar attached. That's why I'm confident @cacheon_ai mainnet will work!
Cacheon mainnet is live.
13 inference servers queued, each racing to beat our baseline on a dedicated 8x H200 pod.
The winner earns up to $10,000/day. Inference optimization starts today on @Bittensor.
Follow along: https://t.co/u52cmcD3Hb
Back-to-back kings crowned in a single eval run on Cacheon testnet. UID 11 took the crown, then UID 20 snatched it minutes later.
Scores are still very small since miners are mostly tuning vLLM defaults. The real jump comes when someone ships actual KV cache reuse, better prefill optimizations, or speculative decoding (V2). That's what this subnet is built for. Keep pushing. 🚀
Early dashboard live at https://t.co/yt4ABEpIS8. Still a very early version with some rough edges. The site is also open source at https://t.co/oIFU8VolaB and we welcome contributions.
The OpenReview thread on TurboQuant vs RaBitQ is worth reading:
– prior work (RaBitQ) not properly addressed, missing apples-to-apples comparison
– RaBitQ author flagged it; still not clearly resolved
– disputed benchmark setup: RaBitQ on Python single-thread CPU vs TurboQuant on H100
Makes you question how much of the “KV cache gains” are actually real. This optimization space needs stricter evals.
Had an awesome conversation with @SubnetSummerTAO about what we’re building.
Perhaps @xavi3rlu was a bit too pumped about it, but the core idea is simple: make LLM serving faster, cheaper, objectively benchmarked, and production-deployable.
🔥 Subnet Summer AMA X @cacheon_ai (Subnet 14) 🔥
@xavi3rlu is building a decentralised inference competition network for open-source AI models, Cacheon is Subnet 14 on Bittensor, creating a permissionless benchmarking system to power the next generation of fast, accurate, and trustless AI inference infrastructure. In this episode, we sit down with the team behind Cacheon, a decentralised inference performance subnet built on Bittensor.
We cover:
- What Cacheon is building and why containerised inference competition matters for the future of open-source AI
- How miners compete by submitting Docker-packaged inference servers optimised for speed and correctness when serving open-source models
- Decentralised validation: how validators benchmark and score miner submissions in real time to ensure outputs meet quality and performance standards
- Cacheon vs centralised inference providers and why the future of model serving should be open, permissionless, and economically incentivised
- The role of token incentives in driving continuous performance improvements and attracting world-class inference engineers to the network
- How Cacheon is pushing the boundaries of what decentralised compute can deliver for AI applications at scale
- Early progress, current network stats, and what's coming next
- Roadmap toward becoming the go-to decentralised inference layer for open-source model deployment
- Live community Q&A
If you're interested in decentralised AI, open-source model serving, GPU compute, or the future of inference infrastructure - this one's for you.
https://t.co/6bHkK0XNbs
Their container exited at startup: --kv-cache-dtype fp8 is not a valid choice in that SGLang build. Use fp8_e4m3 or fp8_e5m2 instead. Also align --served-model-name and --model-path with the contract, and avoid hardcoding tensor parallel size (pod GPUs vary).
Hope to see a resubmit.
https://t.co/zhEWltSLjg
The first SGLang-based miner hit testnet (5GbtnKr4, UID 15)! This is exactly what Cacheon is designed for: bring any language or framework you want, as long as you match the OpenAI-style API we document (/health, streaming chat completions, logprobs, etc.).
🚨 The most overlooked problem in AI right now, and how $TAO's SN14 @cacheon_ai just turned it into a competition.
Everyone is racing to build bigger models. The real meltdown is in serving them.
We are obsessed over model training. Who has the biggest model. Who has the most parameters. Who scored highest on the latest benchmark. That's the Formula 1 race car.
But there is something else once you build the car. Have to drive it, service it, and have a race strategy.
That's inference. It's where AI actually meets reality. That’s the Real Problem
Every time you ask ChatGPT a question. Every time Claude responds. Every time an AI agent acts. There's a machine somewhere doing the work of generating that answer. That machine is slow, expensive, and largely invisible to the user until it isn't.
When you wait 8 seconds for a response. When an API call costs 10x what it should. When an enterprise AI workflow grinds to a halt under load. That's inference failing.
Anthropic just secured massive compute capacity from SpaceXAI's Colossus 1 data center just to keep Claude running. Even the most advanced labs in the world are fighting for the infrastructure to serve their own models. That's how broken this layer is and what many miss.
SN14 is open competition of that peoblem
Cacheon picks one fixed open source model Qwen2.5-72B-Instruct and asks one question:
Who can serve it faster?
Miners build their own inference servers. Any language. Any framework. Custom CUDA kernels. FlashAttention. PagedAttention. Whatever optimization they can dream up. They package it in a Docker container and submit it on-chain
Validators pull every submission and run it on identical hardware against a vLLM baseline
They measure two things:
• Time-to-first-token (how fast the first word arrives)
• Throughput (how many tokens per second the system can produce).
But here's the genius of the design fast is not enough. Correct wins.
If your server is 3x faster but generates wrong outputs, you score zero. The correctness gate runs first. Only then does speed matter. This stops the obvious gaming where someone cuts corners on quality to win on speed.
Fastest correct server becomes the King. Takes 100% of emissions up to 33 $TAO per day until someone beats them. Mainnet launches May 19.
The AI industry has converged. GPT, Claude, Gemini, Grok, the quality gap is closing. What separates products now isn't raw intelligence. It's the experience of using them. Speed. Cost. Reliability. The pit crew, not the race car.
A model that responds in 800ms feels alive. The same model at 4 seconds feels broken. The difference between a viable AI agent and an unusable one is often just inference performance.
This work happens behind closed doors. OpenAI optimizes their stack privately. Anthropic optimizes theirs privately. Google does the same. None of those optimizations ever reach the open-source models the rest of the world actually uses.
Every technique public improvements measurable. The best one wins, gets paid, and becomes the new standard.
Team is legit @xavi3rlu (ex-Opentensor), Clément Blaise, Dera Okeke, with @KibibyteMe advising. First testnet already ran: miners submitted, failed startup requirements exactly as designed.
Roadmap looks good:
▫️V1: Beat vLLM on one model
▫️V2: Speculative decoding, quantization, concurrency
▫️V3: Winning servers become real production endpoints with actual traffic and revenue
▫️V4: Multi-model + OpenRouter integration
Take the layer centralized AI handles worst (inference), open it up, and let the market discover the best solution through competition.
Anthropic just paid SpaceX for inference capacity. Cacheon is building the version where the best optimizations rise continuously and stay open.
While intelligence is getting cheaper, deploying it is becoming more expensive. SN14 is infrastructure.
$TAO
DYOR
🔗
https://t.co/29wNoHg13m
https://t.co/mbLUwUp3iJ
AMA with @SubnetSummerTAO this Thursday.
We’ll talk roadmap, how we plan to capture the market for faster LLM serving, and why an open inference arena can become valuable infrastructure.
Come hang out, ask questions, and hear what’s next for SN14.
🚨Subnet Summer AMA X @cacheon_ai (SN14) 🚨
🕐| 6:30 PM GMT (Thursday, May 14)
Join us as we sit down with the team behind Cacheon (Subnet 14) on Bittensor to explore how they're building a decentralised inference competition network for open-source AI models.
Cacheon is developing a containerised inference benchmarking system, where miners submit Docker-packaged inference servers and validators score them on speed and correctness when serving open-source models. Instead of relying on centralised inference providers, Cacheon introduces a continuously evolving competitive environment where performance is economically incentivised and optimised over time.
At its core, Cacheon represents a shift from centralised AI inference → decentralised, adversarial benchmarking. By leveraging Bittensor's incentive design, the network creates a feedback loop where faster and more accurate inference servers rise to the top, ultimately driving the best open-source model serving infrastructure on the planet.
This AMA is your chance to explore how Cacheon is turning inference performance into a decentralised, market-driven competition.
We'll cover:
- What Cacheon (SN14) is building
- How miners submit and optimise containerised inference servers
- How validators score speed and correctness
- Why open-source model serving matters for decentralised AI
- Token incentives behind high-performance inference
- Real-world use cases and applications
- Early progress and roadmap
- Live Q&A with the team
Cacheon is pushing Bittensor into one of the most fundamental layers of AI infrastructure inference at scale.
Set your reminder 🔔