Javed Rahman @JavedRamen - Twitter Profile

Pinned Tweet

Javed Rahman

@JavedRamen

about 2 years ago · Del Mar

❤️🌊☀️

1

26

1

6K

Javed Rahman

@JavedRamen

about 13 hours ago

I learned more from this post than 5 years of scrolling on insta. Greatest platform itw

Yannick Nick

@keennay

about 16 hours ago

- DeepSeek V4 Flash - Native Precision (FP4 + FP8) - Fits on 2x RTX Pro 6000 GPUs + 256 GB DDR5 RAM - Using KTransformers: KVCache-AI fork of SGLang for GPU/CPU memory inference I have a somewhat obsession running applications on resource constrained systems to squeeze the maximum performance possible. Part of that comes from a past life working as a systems engineer, building & upgrading nationwide (USA) Video-On-Demand streaming backends, while navigating headless *nix servers around the time "cloud" was becoming a buzzword. KTransformers gets less mention across the LLM inference-sphere despite being among the engines listed for many of the popular models on HuggingFace (alongside vLLM, SGLang, & llama.cpp). The KVCache-AI team is best known for providing a forked SGLang for hybrid GPU / CPU memory inference, benefitting MoE models. I expect these hybrid setups to gain in popularity, especially on the consumer side as hardware prices continue soaring. "Necessity is the mother of invention" as they say, and local AI runners will continue finding more creative ways to run intelligence, whether that involves GPU/CPU memory offload, distributed training / inference, model weight / KV Cache quants, or REAPs. Here I have DeepSeek V4 Flash running at a 1M context length on 2x RTX Pro 6000s GPUs, using its native mixed precision of FP4 + FP8. KTransformers allows you to reduce your GPU utilization by offloading experts per MoE layer onto GPU VRAM, with the remaining balanced across system RAM. KTransformers also has the ability to update GPU expert placement during inference from routing statistics collected during the prefill phase. There's also a lot of trial and error involved given the limited amount of kernel support for RTX Pro 6000s. Two of the prompt load stress-test benchmarks I like to run are from the local-inference-lab/llm-inference-bench Github repo & AlienKevin/SWE-ZERO-12M-trajectories HuggingFace dataset. Here are the main KTransformers SGLang optimized flags: - Context Length: 1048576 - Total Number of Tokens: 1048576 - Chunked Prefill Size: 16384 - Max Prefill Tokens: 16384 - GPU Prefill Token Threshold: 1024 - GPU Memory Utilization: 87% - Number of Experts per MoE Layer on GPU: 134 / 256 - Max Running Requests: 256 - CUDA Graph Max Batch Size: 256 - CUDA Graph Batch Sizes: 1 2 4 8 16 32 64 128 256 - Available GPU Memory: 20.81GB (anything less was too tight for agentic coding) Below are the AlienKevin/SWE-ZERO-12M-trajectories benchmark results for 100 prompts with 10 concurrent, ~8k input tokens, & ~1k output tokens. Both Radix & Chunked Prefix Cache were disabled for the absolute worst-case scenario: - Prefill Mean Batch Tokens: 35756.93 tok/sec - Prefill Median Batch Tokens: 652.90 tok/sec - TTFT Mean: 20.698s - TTFT Median: 12.714s - Decode Mean Batch Output Tokens: 27.39 tok/sec - Decode Median Batch Output Tokens: 20.63 tok/sec - Utilized CPU memory: ~200 GB A more detailed write-up will follow, which'll include the methodology of calculating the number of experts per MoE layer on GPU, maximum number of tokens, and GPU memory utilization for a healthy balance for running tool calls & benchmarks in this hybrid setup. Hopefully this'll be reproducible for you and on alternative GPUs, as well as current & future models. Let me know how it works for you! My future plans involve GPU/CPU memory inference tests for MiniMax M3, GLM-5.2, and Kimi K2.7-Code. All links for all of the resources getting DeepSeek V4 Flash native mixed precision on 2x RTX Pro 6000 GPUs + 256 GB RAM can be found in the follow up post.

keennay's tweet photo. - DeepSeek V4 Flash - Native Precision (FP4 + FP8)
- Fits on 2x RTX Pro 6000 GPUs + 256 GB DDR5 RAM
- Using KTransformers: KVCache-AI fork of SGLang for GPU/CPU memory inference

I have a somewhat obsession running applications on resource constrained systems to squeeze the maximum performance possible. Part of that comes from a past life working as a systems engineer, building & upgrading nationwide (USA) Video-On-Demand streaming backends, while navigating headless *nix servers around the time "cloud" was becoming a buzzword.

KTransformers gets less mention across the LLM inference-sphere despite being among the engines listed for many of the popular models on HuggingFace (alongside vLLM, SGLang, & llama.cpp). The KVCache-AI team is best known for providing a forked SGLang for hybrid GPU / CPU memory inference, benefitting MoE models. I expect these hybrid setups to gain in popularity, especially on the consumer side as hardware prices continue soaring.

"Necessity is the mother of invention" as they say, and local AI runners will continue finding more creative ways to run intelligence, whether that involves GPU/CPU memory offload, distributed training / inference, model weight / KV Cache quants, or REAPs.

Here I have DeepSeek V4 Flash running at a 1M context length on 2x RTX Pro 6000s GPUs, using its native mixed precision of FP4 + FP8. KTransformers allows you to reduce your GPU utilization by offloading experts per MoE layer onto GPU VRAM, with the remaining balanced across system RAM. KTransformers also has the ability to update GPU expert placement during inference from routing statistics collected during the prefill phase. There's also a lot of trial and error involved given the limited amount of kernel support for RTX Pro 6000s.

Two of the prompt load stress-test benchmarks I like to run are from the local-inference-lab/llm-inference-bench Github repo & AlienKevin/SWE-ZERO-12M-trajectories HuggingFace dataset.

Here are the main KTransformers SGLang optimized flags:

- Context Length: 1048576
- Total Number of Tokens: 1048576
- Chunked Prefill Size: 16384
- Max Prefill Tokens: 16384
- GPU Prefill Token Threshold: 1024
- GPU Memory Utilization: 87%
- Number of Experts per MoE Layer on GPU: 134 / 256
- Max Running Requests: 256
- CUDA Graph Max Batch Size: 256
- CUDA Graph Batch Sizes: 1 2 4 8 16 32 64 128 256
- Available GPU Memory: 20.81GB (anything less was too tight for agentic coding)

Below are the AlienKevin/SWE-ZERO-12M-trajectories benchmark results for 100 prompts with 10 concurrent, ~8k input tokens, & ~1k output tokens. Both Radix & Chunked Prefix Cache were disabled for the absolute worst-case scenario:

- Prefill Mean Batch Tokens: 35756.93 tok/sec
- Prefill Median Batch Tokens: 652.90 tok/sec
- TTFT Mean: 20.698s
- TTFT Median: 12.714s
- Decode Mean Batch Output Tokens: 27.39 tok/sec
- Decode Median Batch Output Tokens: 20.63 tok/sec
- Utilized CPU memory: ~200 GB

A more detailed write-up will follow, which'll include the methodology of calculating the number of experts per MoE layer on GPU, maximum number of tokens, and GPU memory utilization for a healthy balance for running tool calls & benchmarks in this hybrid setup.

Hopefully this'll be reproducible for you and on alternative GPUs, as well as current & future models. Let me know how it works for you! My future plans involve GPU/CPU memory inference tests for MiniMax M3, GLM-5.2, and Kimi K2.7-Code.

All links for all of the resources getting DeepSeek V4 Flash native mixed precision on 2x RTX Pro 6000 GPUs + 256 GB RAM can be found in the follow up post.

35

712

65

910

300K

0

1

0

11

Javed Rahman

@JavedRamen

2 days ago

@Th0rgal_ @ouraring ??

0

1

0

12

JavedRamen retweeted

𝓡𝓱𝓶𝔃

@TRimamfyen

3 days ago

Drug tests are funny. The guy doing fentanyl all weekend passes by Wednesday. The guy who hit a joint at a barbecue three weeks ago is the one failing.

428

154K

7K

3K

3M

Who to follow

Zishan { }

@itszishanspace

CS grad | helping to launch your first app & webapps | Building stuff........ ✱ 1st Screen studio for window - https://t.co/5pfoBkEYv2

What did u learn today?

@LearnLLM

dami☆

@VaIeriiy

(20) nsfw / was @/varkathesley

Javed Rahman

@JavedRamen

3 days ago

0

2

0

4

Javed Rahman

@JavedRamen

3 days ago

@WorldCupAnime Wow

0

10

JavedRamen retweeted

Max Zanoga

@zanoga

3 days ago

Finally finished building my AI datacenter! 🚀 32x3090s across 4 servers (8 GPUs each), all connected over InfiniBand. The whole setup is solar-powered with a massive battery bank and generator backup. More technical details and benchmarks coming soon.

zanoga's tweet photo. Finally finished building my AI datacenter! 🚀

32x3090s across 4 servers (8 GPUs each), all connected over InfiniBand.

The whole setup is solar-powered with a massive battery bank and generator backup.

More technical details and benchmarks coming soon. https://t.co/8GfedrSzNp

588

6K

407

2K

794K

JavedRamen retweeted

World of Statistics

@stats_feed

3 days ago

🇵🇹 Portugal FIFA World Cup 🏆 1930 – Did not enter 1934 – Did not qualify 1938 – Did not qualify 1950 – Did not qualify 1954 – Did not qualify 1958 – Did not qualify 1962 – Did not qualify 1966 – Third place 🥉 1970 – Did not qualify 1974 – Did not qualify 1978 – Did not qualify 1982 – Did not qualify 1986 – Group stage 1990 – Did not qualify 1994 – Did not qualify 1998 – Did not qualify 2002 – Group stage 2006 – Fourth place 2010 – Round of 16 2014 – Group stage 2018 – Round of 16 2022 – Quarter-finals 2026 – Qualified ✅ 🇵🇹 Portugal FIFA World Cup 🏆 🇺🇾 1930 - Did not enter 🇮🇹 1934 - Did not qualify 🇫🇷 1938 - Did not qualify 🇧🇷 1950 - Did not qualify 🇨🇭 1954 - Did not qualify 🇸🇪 1958 - Did not qualify 🇨🇱 1962 - Did not qualify 🏴󠁧󠁢󠁥󠁮󠁧󠁿 1966 - Third place 🥉 🇲🇽 1970 - Did not qualify 🇩🇪 1974 - Did not qualify 🇦🇷 1978 - Did not qualify 🇪🇸 1982 - Did not qualify 🇲🇽 1986 - Group stage 🇮🇹 1990 - Did not qualify 🇺🇸 1994 - Did not qualify 🇫🇷 1998 - Did not qualify 🇰🇷🇯🇵 2002 - Group stage 🇩🇪 2006 - Fourth place 🇿🇦 2010 - Round of 16 🇧🇷 2014 - Group stage 🇷🇺 2018 - Round of 16 🇶🇦 2022 - Quarter-finals 🇨🇦🇲🇽🇺🇸 2026 - In progress 🔄

8

193

14

21

71K

JavedRamen retweeted

Cristiano Ronaldo

@Cristiano

3 days ago

Unidos! 🇵🇹

13K

989K

63K

10K

20M

JavedRamen retweeted

FLBorn🕷️

@Master_IP10

4 days ago

gonna take a break from women to focus on substance abuse

154

23K

3K

1K

547K

JavedRamen retweeted

World Cup Anime

@WorldCupAnime

3 days ago

Ronaldo Is Back.

615

78K

9K

15K

4M

Javed Rahman

@JavedRamen

3 days ago

0

Javed Rahman

@JavedRamen

4 days ago

Entirely true

Prompter

@PromptLLM

4 days ago

Crazy take from Claude

557

59K

8K

16K

3M

0

10

JavedRamen retweeted

Cristiano Ronaldo

@Cristiano

4 days ago

ESTAMOS AQUI!

33K

2M

159K

19K

42M

JavedRamen retweeted

Sakana AI

@SakanaAILabs

5 days ago

Introducing Sakana Fugu: A full multi-agent orchestration system accessible via a single model API. Our ‘Fugu Ultra’ model matches the performance of Fable and Mythos, delivering frontier capability without the risk of export controls. Try it: https://t.co/hhO6qTawgb 🐡

1K

38K

6K

30K

26M

JavedRamen retweeted

Elon Musk

@elonmusk

5 days ago

“Full access” 😂

7K

364K

18K

17K

58M

Javed Rahman

@JavedRamen

7 days ago

😂

0

3

JavedRamen retweeted

Aakash Gupta

@aakashgupta

9 days ago

Let me explain why an AI art company just built a full-body medical scanner, because almost everyone is reading this as a random pivot. Ultrasonic CT works by firing sound through your body and recording the ripples that scatter back. Half a million emitters the size of a grain of sand, surrounding you in water, each one listening. What comes back is noise. Reconstructing a clean 3D image of muscle and tissue from that scattered acoustic mess is an inverse problem, and it is brutally hard. The hardware is the easy part. Butterfly Network already makes the chips. The reconstruction is where every previous attempt stalled. That reconstruction is the exact problem Midjourney spent years getting good at. Turning ambiguous input into a coherent image is what they do. They aimed it at sound waves instead of text prompts. This is why the scan takes 60 seconds while a full-body MRI takes 60 to 90 minutes. Close to 100x faster, no radiation, no magnets, resolution down to a fraction of a millimeter. Then read the part most people skipped. The scans happen at a spa. Hot tubs, cold plunges, and a machine that quietly images your whole body while you relax. The scan is a side effect. You barely notice it. Run it forward. The plan is 50,000 machines doing a billion scans every month. Midjourney has no investors and no quarterly hardware margin to chase. The payoff was never the scan fee. A billion monthly full-body scans is the largest longitudinal map of human anatomy ever assembled. Every model trained on it gets sharper, and every sharper model makes the next scan worth more. This was always an image company. They just found a kind of image nobody else could generate.

111

4K

372

2K

481K

JavedRamen retweeted

Midjourney

@midjourney

9 days ago

A technical dive inside our new "Midjourney Scanner"

1K

28K

3K

12K

11M

Javed Rahman

@JavedRamen

10 days ago

🐐 platform

arc.

@arceyul

11 days ago

Github acaba de ☠️ al vibe coding Acaba de publicar spec-kit y en pocos días tiene 95k estrellas y 8.3k forks Esto no es un proyecto cualquiera. Es GitHub diciéndote cómo se programa con IA de verdad. El problema con los agentes de IA no es el modelo Es que le mandas una idea en texto y él interpreta lo que quiere Spec-kit resuelve eso con 6 comandos que convierten tu idea en una especificación estructurada antes de escribir una sola línea de código ✅ /speckit.constitution → las reglas del proyecto: calidad, testing, arquitectura ✅ /speckit.specify → describes QUÉ construir, no el stack ✅ /speckit.clarify → el agente pregunta lo que no entiende antes de empezar ✅ /speckit.plan → ahora sí eliges la tecnología ✅ /speckit.tasks → lista de tareas ordenada por dependencias ✅ /speckit.implement → el agente construye El entregable ya no es código generado a lo loco Es una especificación viva que tu IA lee, valida y ejecuta paso a paso Funciona con Claude Code, Cursor, Copilot, Codex, Gemini CLI y más de 25 agentes La diferencia real es esta Antes: "hazme una app de tareas" y rezas para que el agente no se pierda a mitad Ahora: especificación primero, código después El agente sabe exactamente qué construir, en qué orden y por qué 95k estrellas. 8.3k forks. Publicado por el propio GitHub. Licencia MIT. el repo aquí ⬇️

56

3K

441

6K

251K

0

18

JavedRamen retweeted

Unreal Engine @UnrealEngine

10 days ago

Unreal Engine 5.8 ships today with experimental MCP server support: Your sources, your pipeline and your workflow—simply configure the MCP plugin and connect to any agent. Get familiar with the MCP server and the PCG Primitive Plugin today and see what teams can build together: https://t.co/cDITLWWv2F

233

7K

900

5K

3M

Javed Rahman

@JavedRamen

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users