Approaching.AI

@ApproachingAI

is a high-performance AI infrastructure company focused on large model inference optimization.

Joined April 2026

11 Following

6 Followers

13 Posts

Approaching.AI

@ApproachingAI

27 days ago

📍 vLLM Community Meetup — May 23 https://t.co/lSgpkevhhO joined the vLLM @vllm_project community meetup themed "How will inference acceleration frameworks reshape LLM deployment in the Agentic AI era?" — alongside Red Hat, Inferact, Alibaba, NVIDIA, Moonshot AI, and the open-source community. Lu Jiahao (https://t.co/lSgpkevhhO tech expert, Mooncake core contributor) co-presented "llm-d + Mooncake: Large-Scale Agentic AI Inference in Practice" with Greg Pereira, core maintainer of llm-d. ━━━━━━━━━━ 🔹 The Agentic AI bottleneck As context lengths grow, traditional compute paradigms face new cost and performance pressure. But multi-turn interactions, tool calls and long-context reuse give Agentic workloads very high KV cache hit rates — a window for "storage-for-compute" trade-offs. 🔹 Mooncake: store more, transfer fast Transfer Engine (TENT) — evolves from a passive communication library into an active orchestrator. Declarative orchestrationreplaces imperative APIs, delivering low-latency, high-throughput transfer across heterogeneous interconnects. Mooncake Store — a high-capacity, low-cost SSD tier joins the cache hierarchy. TCO drops, performance holds, and the KV cache poolscales beyond single-machine limits for cross-instance, cross-node context reuse. 🔹 llm-d × Mooncake roadmap Phase 1 — MooncakeConnector: PD disaggregation, P2P transfer, Messages API Phase 2 — MooncakeStoreConnector: tiered cache offload, Responses API ━━━━━━━━━━ https://t.co/lSgpkevhhO will keep contributing to Mooncake and partner with the community to push inference systems toward a new balance of scale, performance and cost for the Agentic AI era.

Approaching.AI

@ApproachingAI

about 1 month ago

Handy for capacity planning and config sizing — give it a try 👍

KVCache.AI

@KVCache_AI

about 1 month ago

🚀 We just launched the open-source KV Cache Size Calculator by https://t.co/GavTTEDu5C! Calculate KV cache size for mainstream LLMs with flexible precision settings and detailed breakdowns. Supports DeepSeek, GLM, Kimi, Qwen3 and MiniMax. Try it now: https://t.co/PpN0bvKSVI

KVCache_AI's tweet photo. 🚀 We just launched the open-source KV Cache Size Calculator by https://t.co/GavTTEDu5C!

Calculate KV cache size for mainstream LLMs with flexible precision settings and detailed breakdowns.

Supports DeepSeek, GLM, Kimi, Qwen3 and MiniMax.

Try it now: https://t.co/PpN0bvKSVI https://t.co/dUYu67vhrf

139

110

47K

Approaching.AI

@ApproachingAI

about 1 month ago

At AMD Developer Day @AIatAMD 🚀 Tsinghua University Assoc. Prof. and https://t.co/lSgpkevhhO co-initiator Mingxing Zhang @james0zan gave a talk: “How Do We Reduce Token Costs in the Agent Era?” --- The core shift in AI workloads: We're moving from simple chatbots to complex agent applications. Each agent task requires dozens of model calls and heavy context I/O, pushing token consumption from 1x to 1000x. Not all tokens share the same SLO, and compute infrastructure does not equal efficient token production. Cache, transfer, and token reuse matter as much as the silicon underneath. --- This thinking shapes Mooncake https://t.co/jfLpfkN40F The open-source KV-cache-centric serving substrate for modern LLMs. Originally focused on KV cache — store more, transfer fast, easy to use — Mooncake has grown into a common substrate for disaggregated LLM serving: ✅ PD / EPD / RL disaggregation ✅ Large-scale elastic EP ✅ Efficient model weight loading --- Latest Mooncake highlights from the talk: ⚡ Checkpoint Engine — 1T parameters synced in seconds for distributed RL ⚡ Elastic EP in SGLang — partial failure tolerance for DeepSeek MoE deployments ⚡ vLLM native integration — scaling agentic workloads in production ⚡ Omni Models ⚡ TorchSpec — decoupled speculative decoding framework, streaming hidden states from inference to training ⚡ RL toolchains — TransferQueue / Slime (WIP) / Roll (WIP) for cross-device async non-blocking scheduling ⚡ Production-proven — large-scale deployment validated by RBG + SGLang + Mooncake --- Thanks @AIatAMD for hosting. We look forward to deepening collaboration with the AMD developer community on the infrastructure layer of the agent economy.

ApproachingAI's tweet photo. At AMD Developer Day @AIatAMD 🚀
Tsinghua University Assoc. Prof. and https://t.co/lSgpkevhhO co-initiator Mingxing Zhang @james0zan gave a talk:
“How Do We Reduce Token Costs in the Agent Era?”
---
The core shift in AI workloads:

We're moving from simple chatbots to complex agent applications.
Each agent task requires dozens of model calls and heavy context I/O, pushing token consumption from 1x to 1000x.
Not all tokens share the same SLO, and compute infrastructure does not equal efficient token production.
Cache, transfer, and token reuse matter as much as the silicon underneath.
---
This thinking shapes Mooncake
https://t.co/jfLpfkN40F

The open-source KV-cache-centric serving substrate for modern LLMs.
Originally focused on KV cache — store more, transfer fast, easy to use — Mooncake has grown into a common
substrate for disaggregated LLM serving:
✅ PD / EPD / RL disaggregation
✅ Large-scale elastic EP
✅ Efficient model weight loading
---
Latest Mooncake highlights from the talk:

⚡ Checkpoint Engine — 1T parameters synced in seconds for distributed RL
⚡ Elastic EP in SGLang — partial failure tolerance for DeepSeek MoE deployments
⚡ vLLM native integration — scaling agentic workloads in production
⚡ Omni Models
⚡ TorchSpec — decoupled speculative decoding framework, streaming hidden states from inference to training
⚡ RL toolchains — TransferQueue / Slime (WIP) / Roll (WIP) for cross-device async non-blocking scheduling
⚡ Production-proven — large-scale deployment validated by RBG + SGLang + Mooncake
---
Thanks @AIatAMD for hosting.
We look forward to deepening collaboration with the AMD developer community on the infrastructure layer of the agent economy.

103

Approaching.AI

@ApproachingAI

about 1 month ago

The goal: let developers work with frontier LLMs on the hardware they already have.Local inference → local fine-tuning. Model adaptation → multi-platform. KTransformers is building more open local LLM infrastructure.

Approaching.AI

@ApproachingAI

about 1 month ago

At GOSIM Paris 2026 @gosimfoundation , https://t.co/TNZHDOONnW's Chief Engine Architect Ervin Xie shared what's next for KTransformers: local fine-tuning via KT-SFT (run + tune on one workstation), and broader platform support — including Windows.

Approaching.AI

@ApproachingAI

about 1 month ago

5/ Deploy in 3 steps: download weights from HuggingFace → launch SGLang server with KT-Kernel → interact via KT CLI or API. Full guide: doc/en/DeepSeek-V4-Flash.md

Approaching.AI

@ApproachingAI

about 1 month ago

1/ KTransformers now supports DeepSeek-V4 on consumer hardware. Single RTX 5090 + RAM → V4-Flash at 20 tok/s. Two RTX 5090s → V4-Pro. First time a model at this scale runs end-to-end on consumer GPUs.

Approaching.AI

@ApproachingAI

about 1 month ago

4/ New operators: MX_FP4_MOE_TP on CPU (native E2M1 quant + ue8m0 scaling, group_size=32). SM_120 portable MXFP4 MoE kernel on GPU for Blackwell. Full path from quantization format to hardware execution.

Approaching.AI

@ApproachingAI

2 months ago

Proud to be a launch partner for Kimi K2.6! 🎉 we shipped full support on Day 0： ✅ Token service live； ✅ KTransformers adapted — full-precision inference & fine-tuning with just 48GB VRAM + sufficient RAM Try it now 👉 https://t.co/D6YauRMrIW

Kimi.ai @Kimi_Moonshot

2 months ago

Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on https://t.co/YutVbwktG0 in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: https://t.co/uvoSJKyGCY - 🔗 API: https://t.co/EOZkbOwCN4 🔗 Tech blog: https://t.co/9wWvgIQSS3 🔗 Weights & code: https://t.co/Be0hjs2RTP

Kimi_Moonshot's tweet photo. Meet Kimi K2.6: Advancing Open-Source Coding

🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2)

What's new:
🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization).
🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D.
🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files.
🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops.
🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop.
-
K2.6 is now live on https://t.co/YutVbwktG0 in chat mode and agent mode.
For production-grade coding, pair K2.6 with Kimi Code: https://t.co/uvoSJKyGCY
-
🔗 API: https://t.co/EOZkbOwCN4
🔗 Tech blog: https://t.co/9wWvgIQSS3
🔗 Weights & code: https://t.co/Be0hjs2RTP

943

18K

Approaching.AI

@ApproachingAI

2 months ago

Mooncake × SGLang introduce Elastic Expert Parallelism (Elastic EP) for large-scale MoE inference. With partial failure tolerance, Elastic EP enables recovery from GPU/node failures in seconds — without interrupting serving. https://t.co/o37uXG2UQ8

Approaching.AI

@ApproachingAI

2 months ago

Big update from https://t.co/1v5Wy3AVgv: Dr. Jenny Wu joins as President, bringing experience from Baidu Capital, https://t.co/1l61Wllrja, Morgan Stanley, and New Hope. She’ll lead strategy, finance, and global ops as we scale efficient AI inference. More to come. #Inference

Approaching.AI

@ApproachingAI

Last Seen Users on Sotwe

Trends for you

Most Popular Users