LLMKube 0.8.0 shipped:
→ Foreman, an opt-in kube-native orchestrator for agent fleets across heterogeneous local LLM hardware
→ Coder + verifier + reviewer pipeline running on Apple Silicon + NVIDIA + Intel
→ Intel oneAPI / SYCL GPU support from a first-time community contributor (PR #557)
→ Foreman authored 2 of its own PRs into this release (#508, #588), signed-off-by Foreman Bot
https://t.co/sh4Z8jlCDB
Two $500 consumer GPUs. 32 GB total.
A 35B model just ran at its full 256K context on them. 512K with YaRN.
Standard f16 KV runs out of memory before it even reaches native.
TurboQuant KV on consumer Blackwell. Writeup 👇
https://t.co/n4a9xDTAUh
@BlockedPaths Exactly what I was going for! For folks who have homelabs, that's very much the case. I have seen businesses in the same spot too. I really believe there is a lot to offer around heterogeneous inference systems.
LLMKube 0.8.0 shipped:
→ Foreman, an opt-in kube-native orchestrator for agent fleets across heterogeneous local LLM hardware
→ Coder + verifier + reviewer pipeline running on Apple Silicon + NVIDIA + Intel
→ Intel oneAPI / SYCL GPU support from a first-time community contributor (PR #557)
→ Foreman authored 2 of its own PRs into this release (#508, #588), signed-off-by Foreman Bot
https://t.co/sh4Z8jlCDB
@Youssofal_ Strongly with you on this. A lot of what we ship on our local-model orchestrator is scaffolding fixes: schema strictness, scheduler timing, cmd deadlocks, port-stale on respawn. The model is rarely the bottleneck; the harness almost always is. Whole layer is underbuilt.
LLMKube 0.7.9 is out!
Your Mac is now a Kubernetes inference node running mlx-server, an OpenAI-compatible MLX runtime. Qwen3.6-35B on an M5 Max: 102.7 tok/s, 107ms TTFT.
Plus kubectl scale and autoscaling fixes.
https://t.co/ONWMQvkemS
@JadenHorst@0xgaut 100%! There’s no excuse for not seeing whats coming. These providers will raise rates when investors want profit instead of growth and people will be blindsided. Now is the time to get really serious about local inference.
LLMKube 0.7.8 ships ModelRouter Phase 1!
One OpenAI endpoint. Local + Anthropic/OpenAI/Bedrock/Vertex/LiteLLM. Fail-closed PII. Per-rule timeouts. Half-open circuit breaker. Audit log per request.
Local-first agentic, hybrid when you need it.
https://t.co/MUrsWlrcUo
@ClementDelangue It's been amazing to see and be a part of. What was once something deemed impossible is now part of my daily flow locally. It just keeps getting better! Crazy how fast things are evolving too.
LLMKube 0.7.7 shipped:
→ vllm-swift on Apple Silicon w/ TurboQuant KV cache passthrough
→ OpenShift / OKD / MicroShift a first-class deploy target
→ vLLM tuning fields (gpuMemoryUtilization, cpuOffloadGB) from a French community PR
→ Longhorn FSGroup fix from a user bug report with a full reproducer
https://t.co/TjUUwDwvD0
@MemoryReboot_ Fair if your model is one box doing everything. Different approach: Mac Studio as a node in a heterogeneous K8s cluster. llmkube's Metal Agent schedules across Apple Silicon, NVIDIA, AMD. Tool calling on unified memory, throughput on CUDA. Best of both worlds.
@Tuggernutz87 I've done mostly driving around town and use hurry mode but have been really impressed. Already noticed on the freeway it didn't want to left lane camp. That alone is a HUGE improvement!
Weekend shipping update: LLMKube 0.7.6.
Headline fix: the metal-agent will stop killing your only model when system RAM spikes. Priority eviction + friendly-fire guard + per-service opt-out.
Mutable modelRef. vLLM parallelSlots from a community PR.
https://t.co/fhmbL83sNv