- DeepSeek V4 Flash - Native Precision (FP4 + FP8)
- Fits on 2x RTX Pro 6000 GPUs + 256 GB DDR5 RAM
- Using KTransformers: KVCache-AI fork of SGLang for GPU/CPU memory inference
I have a somewhat obsession running applications on resource constrained systems to squeeze the maximum performance possible. Part of that comes from a past life working as a systems engineer, building & upgrading nationwide (USA) Video-On-Demand streaming backends, while navigating headless *nix servers around the time "cloud" was becoming a buzzword.
KTransformers gets less mention across the LLM inference-sphere despite being among the engines listed for many of the popular models on HuggingFace (alongside vLLM, SGLang, & llama.cpp). The KVCache-AI team is best known for providing a forked SGLang for hybrid GPU / CPU memory inference, benefitting MoE models. I expect these hybrid setups to gain in popularity, especially on the consumer side as hardware prices continue soaring.
"Necessity is the mother of invention" as they say, and local AI runners will continue finding more creative ways to run intelligence, whether that involves GPU/CPU memory offload, distributed training / inference, model weight / KV Cache quants, or REAPs.
Here I have DeepSeek V4 Flash running at a 1M context length on 2x RTX Pro 6000s GPUs, using its native mixed precision of FP4 + FP8. KTransformers allows you to reduce your GPU utilization by offloading experts per MoE layer onto GPU VRAM, with the remaining balanced across system RAM. KTransformers also has the ability to update GPU expert placement during inference from routing statistics collected during the prefill phase. There's also a lot of trial and error involved given the limited amount of kernel support for RTX Pro 6000s.
Two of the prompt load stress-test benchmarks I like to run are from the local-inference-lab/llm-inference-bench Github repo & AlienKevin/SWE-ZERO-12M-trajectories HuggingFace dataset.
Here are the main KTransformers SGLang optimized flags:
- Context Length: 1048576
- Total Number of Tokens: 1048576
- Chunked Prefill Size: 16384
- Max Prefill Tokens: 16384
- GPU Prefill Token Threshold: 1024
- GPU Memory Utilization: 87%
- Number of Experts per MoE Layer on GPU: 134 / 256
- Max Running Requests: 256
- CUDA Graph Max Batch Size: 256
- CUDA Graph Batch Sizes: 1 2 4 8 16 32 64 128 256
- Available GPU Memory: 20.81GB (anything less was too tight for agentic coding)
Below are the AlienKevin/SWE-ZERO-12M-trajectories benchmark results for 100 prompts with 10 concurrent, ~8k input tokens, & ~1k output tokens. Both Radix & Chunked Prefix Cache were disabled for the absolute worst-case scenario:
- Prefill Mean Batch Tokens: 35756.93 tok/sec
- Prefill Median Batch Tokens: 652.90 tok/sec
- TTFT Mean: 20.698s
- TTFT Median: 12.714s
- Decode Mean Batch Output Tokens: 27.39 tok/sec
- Decode Median Batch Output Tokens: 20.63 tok/sec
- Utilized CPU memory: ~200 GB
A more detailed write-up will follow, which'll include the methodology of calculating the number of experts per MoE layer on GPU, maximum number of tokens, and GPU memory utilization for a healthy balance for running tool calls & benchmarks in this hybrid setup.
Hopefully this'll be reproducible for you and on alternative GPUs, as well as current & future models. Let me know how it works for you! My future plans involve GPU/CPU memory inference tests for MiniMax M3, GLM-5.2, and Kimi K2.7-Code.
All links for all of the resources getting DeepSeek V4 Flash native mixed precision on 2x RTX Pro 6000 GPUs + 256 GB RAM can be found in the follow up post.
Drug tests are funny.
The guy doing fentanyl all weekend passes by Wednesday.
The guy who hit a joint at a barbecue three weeks ago is the one failing.
Finally finished building my AI datacenter! 🚀
32x3090s across 4 servers (8 GPUs each), all connected over InfiniBand.
The whole setup is solar-powered with a massive battery bank and generator backup.
More technical details and benchmarks coming soon.
🇵🇹 Portugal FIFA World Cup 🏆
1930 – Did not enter
1934 – Did not qualify
1938 – Did not qualify
1950 – Did not qualify
1954 – Did not qualify
1958 – Did not qualify
1962 – Did not qualify
1966 – Third place 🥉
1970 – Did not qualify
1974 – Did not qualify
1978 – Did not qualify
1982 – Did not qualify
1986 – Group stage
1990 – Did not qualify
1994 – Did not qualify
1998 – Did not qualify
2002 – Group stage
2006 – Fourth place
2010 – Round of 16
2014 – Group stage
2018 – Round of 16
2022 – Quarter-finals
2026 – Qualified ✅
🇵🇹 Portugal FIFA World Cup 🏆
🇺🇾 1930 - Did not enter
🇮🇹 1934 - Did not qualify
🇫🇷 1938 - Did not qualify
🇧🇷 1950 - Did not qualify
🇨🇭 1954 - Did not qualify
🇸🇪 1958 - Did not qualify
🇨🇱 1962 - Did not qualify
🏴 1966 - Third place 🥉
🇲🇽 1970 - Did not qualify
🇩🇪 1974 - Did not qualify
🇦🇷 1978 - Did not qualify
🇪🇸 1982 - Did not qualify
🇲🇽 1986 - Group stage
🇮🇹 1990 - Did not qualify
🇺🇸 1994 - Did not qualify
🇫🇷 1998 - Did not qualify
🇰🇷🇯🇵 2002 - Group stage
🇩🇪 2006 - Fourth place
🇿🇦 2010 - Round of 16
🇧🇷 2014 - Group stage
🇷🇺 2018 - Round of 16
🇶🇦 2022 - Quarter-finals
🇨🇦🇲🇽🇺🇸 2026 - In progress 🔄
Introducing Sakana Fugu: A full multi-agent orchestration system accessible via a single model API.
Our ‘Fugu Ultra’ model matches the performance of Fable and Mythos, delivering frontier capability without the risk of export controls.
Try it: https://t.co/hhO6qTawgb 🐡
Let me explain why an AI art company just built a full-body medical scanner, because almost everyone is reading this as a random pivot.
Ultrasonic CT works by firing sound through your body and recording the ripples that scatter back. Half a million emitters the size of a grain of sand, surrounding you in water, each one listening. What comes back is noise. Reconstructing a clean 3D image of muscle and tissue from that scattered acoustic mess is an inverse problem, and it is brutally hard. The hardware is the easy part. Butterfly Network already makes the chips. The reconstruction is where every previous attempt stalled.
That reconstruction is the exact problem Midjourney spent years getting good at. Turning ambiguous input into a coherent image is what they do. They aimed it at sound waves instead of text prompts.
This is why the scan takes 60 seconds while a full-body MRI takes 60 to 90 minutes. Close to 100x faster, no radiation, no magnets, resolution down to a fraction of a millimeter.
Then read the part most people skipped. The scans happen at a spa. Hot tubs, cold plunges, and a machine that quietly images your whole body while you relax. The scan is a side effect. You barely notice it.
Run it forward. The plan is 50,000 machines doing a billion scans every month. Midjourney has no investors and no quarterly hardware margin to chase. The payoff was never the scan fee.
A billion monthly full-body scans is the largest longitudinal map of human anatomy ever assembled. Every model trained on it gets sharper, and every sharper model makes the next scan worth more. This was always an image company. They just found a kind of image nobody else could generate.
Github acaba de ☠️ al vibe coding
Acaba de publicar spec-kit y en pocos días tiene 95k estrellas y 8.3k forks
Esto no es un proyecto cualquiera. Es GitHub diciéndote cómo se programa con IA de verdad.
El problema con los agentes de IA no es el modelo
Es que le mandas una idea en texto y él interpreta lo que quiere
Spec-kit resuelve eso con 6 comandos que convierten tu idea en una especificación estructurada antes de escribir una sola línea de código
✅ /speckit.constitution → las reglas del proyecto: calidad, testing, arquitectura
✅ /speckit.specify → describes QUÉ construir, no el stack
✅ /speckit.clarify → el agente pregunta lo que no entiende antes de empezar
✅ /speckit.plan → ahora sí eliges la tecnología
✅ /speckit.tasks → lista de tareas ordenada por dependencias
✅ /speckit.implement → el agente construye
El entregable ya no es código generado a lo loco
Es una especificación viva que tu IA lee, valida y ejecuta paso a paso
Funciona con Claude Code, Cursor, Copilot, Codex, Gemini CLI y más de 25 agentes
La diferencia real es esta
Antes: "hazme una app de tareas" y rezas para que el agente no se pierda a mitad
Ahora: especificación primero, código después
El agente sabe exactamente qué construir, en qué orden y por qué
95k estrellas. 8.3k forks. Publicado por el propio GitHub. Licencia MIT.
el repo aquí ⬇️
Unreal Engine 5.8 ships today with experimental MCP server support:
Your sources, your pipeline and your workflow—simply configure the MCP plugin and connect to any agent. Get familiar with the MCP server and the PCG Primitive Plugin today and see what teams can build together: https://t.co/cDITLWWv2F