Compass is a free blueprint of a production-grade customer support agent built to demonstrate how modern agent systems are actually engineered and operated in real environments.
You can explore the live demo at https://t.co/W1kc0jAFfx and read the walk-through here (https://t.co/Z6esMGK0ha)
You can also get the full repository (https://t.co/sbpeBK9z8F).
It includes infrastructure and service boundaries across LangGraph + Deep Agents (@LangChain), @togethercompute, Kong, Caddy, @PostgreSQL, @Redisinc , @qdrant_engine , @redpandadata, @Minio, @PrometheusIO , Loki, @grafana, Vault, along with the APIs and UI surfaces needed to support both end users and operators.
The goal is to reason about the entire application topology end-to-end: request flow, state management, review workflows, data dependencies, observability, and operational control.
If you or your team is currently evaluating concepts such as deep agents, agentic workflows, multi-agent orchestration, RAG pipelines, guardrails, observability, prompt/version management, or human-in-the-loop controls, Compass gives you something much more useful than a conceptual diagram: it gives you a concrete system you can inspect, run, and study end-to-end.
Personally I think that there is finally a cheap model with vision, long context, and coding-agent capability (minimax-m3) but also can’t help to ask “Is this benchmark-optimized but weaker in real use? Will it run locally or too large for local hardware?”
M5 MacBook Pro or NVIDIA DGX Spark or RTX PRO 6000?
Before answering this in depth, please think about the following question: "What part of the agent loop is your hardware supposed to accelerate?"
A chat UI and an agentic system stress hardware differently.
An agentic coding system cares about repeated long-context reads, tool calls, cache churn, parallel background workers, retrieval, subprocesses, test runs, container work, and sometimes many independent agents trying to use the same model server at once.
Rather than comparing “Mac versus NVIDIA” or “unified memory versus VRAM”, you should focus on the workload decomposition problem.
Before we start the hardware discussion, please do not buy hardware by comparing one headline tokens-per-second number.
Map your agent workflow to four bottlenecks:
- Model fit
- Prefill latency
- Decode throughput
- Concurrent serving behavior
Once you do that, the debate gets much less emotional and much more useful.
Read the deep dive here: https://t.co/Xva4GFWSPb
Most developers are wondering:
How can I run a 70B model locally?
Can I run a 120B MoE?
Can I finally stop paying API bills?
People with 2x RTX 3090s, RTX 4090s, RTX 6000-class cards, Strix Halo systems, big-memory Macs, and mixed GPU rigs are able to run completely different class of models.
https://t.co/fWVwywLdET
You are NOT ready for what's coming this week 🔥
- Qwen3.7 max
- Qwen3.7 27b/35b
- Minimax M3.0
- Gemini 3.5 Pro/Flash
- GPT-5.6
- Sonnet-4.8
- Kimi/GLM?
Which one are you evaluating first?
@alexisgallagher@iamtrask thanks a lot @alexisgallagher, model capabilities improve with each new release, but deploying and operating local LLMs still requires many trade-offs and attention to edge cases
great stats regardless @ashen_one, given that you can run gemma-4-26gb or Qwen3.6-35B-A3B on M5 macbook pro variants or 4090, curious to see performance diff vs GLM 5.1, what do you think? We covered some of the latest open-source models (qwen, gemma) that you can run on prosumer hardware here https://t.co/Pyp0kB4Rhl
@breath_mirror@bstnxbt great performance, will definitely have a look at the implementation. normally, if you have rtx 4090 or M5 pro/max, you can run Qwen3.6-35B-A3B at 50-60 tok/s, we published a deep dive here https://t.co/dSI7zQPagZ
@ivanfioravanti great stats indeed Ivan, thanks a lot for sharing. We also shared recent oMLX community benchmarks here along with qwen and gemma models that you can run locally now on your prosumer hardware https://t.co/H3zTgRvR6N
@devruso agreed, just this morning we publihed this piece, if you have high-tier workstation or M5 macbook pro, you are now good to go with recent releases of qwen and gemma models https://t.co/H3zTgRvR6N
What GPU (RTX4090 or M5 Max 128GB) can run parts of the agent loop locally with good performance?
A single RTX 4090 can already handle a serious amount of coding work with the right open model, the right quantization, and a harness that doesn’t waste half your context window on boilerplate.
Apple Silicon machines like an M5 Pro 64GB or M5 Max 128GB are compelling for a different reason: not because they magically beat the cloud, but because they buy you local capacity, mobility, privacy, and larger-context workflows.
Meanwhile, the cloud still wins whenever you need top-end reasoning reliability, long-horizon planning, or consistently excellent output under pressure.
We want to make a developer-first case for how to think about consumer hardware in 2026 if you are building AI features, coding agents, or internal agentic workflows.
We will go through:
- What consumer hardware can realistically do today (including model benchmarks on Apple Silicon)
- State of open coding models in 2026
- Where local models already pull their weight and where they still break
- Why the harness matters almost as much as the model
- Why the best local setup is often a subagent or coprocessor, not a full replacement
Read the deep dive here: https://t.co/Pyp0kB4jrN
How about orchestrating a codebase on 5GB VRAM using a local Qwen3.5-35B-A3B (~25-30 tokens/sec through llama.cpp, 65k context, remaining layers offloaded to system RAM)?
Even better, when two simultaneous agent instances can run comfortably at ~15-20 t/s, which natively supports both thinking and non-thinking models (including Gemma 4), or can be pointed at heavy-compute cloud endpoints for complex architectural tasks.
If this sounds too good to be true, please keep reading.
Running local LLMs often feels like a downgrade from premium cloud subscriptions, but the real constraint is not just model quality, it is systems design.
Context windows are finite, and simply increasing token capacity does not eliminate the need to control what the model sees.
In practice, larger contexts frequently introduce more noise, more drift, and weaker reasoning when that context is not actively curated.
What local coding agents need is not a bigger monolithic chat loop, but a better execution architecture: a lighterweight terminal environment that separates planning from implementation.
The primary orchestrator should operate like a lead architect. It should inspect the codebase, build a concrete implementation plan, decompose work into atomic tasks, and dispatch those tasks to short-lived subagents with tightly scoped, isolated contexts.
Each coding subagent should execute one bounded change, return a compact summary of the result, and terminate.
That keeps the planner’s context clean, prevents edit history from ballooning, and avoids the gradual degradation you get when every action is forced through one ever-expanding conversation.
The result is a system that behaves less like a confused chatbot and more like a disciplined engineering team with clear task boundaries and fast feedback loops.
That is the idea behind Late.
Late is a deterministic coding-agent orchestrator built to make local LLMs viable for serious agentic software development.
Instead of dumping an entire repository into a single context window and hoping the model stays coherent, it maps the codebase, maintains a high-level control plane, and spawns ephemeral execution agents to perform precise, exact-match code edits.
By mirroring the structure of a real engineering organization, Late reduces token bloat, limits context pollution, and improves reliability under long-running coding workflows.
https://t.co/prnlQJjnTn