Been buried in work travel for weeks, but I finally got a window to update https://t.co/7N1iyI24ys
It’s now running Gemma 4 E2B QAT. Small update, big improvement.
New blog: Building agents that reach production systems with MCP.
When should agents use direct APIs vs CLIs vs MCP? Plus patterns for building MCP servers, context-efficient clients and pairing MCP with skills.
https://t.co/Q4UrUVgVYB
I'm a keyboard person... ⌘C all day.
One day I thought: what if copying twice triggered an action on what I just copied?
Turns out it works. So I built it
@ivanfioravanti This worked for me on Apple Silicon with llama-server + Gemma 4:
https://t.co/v5qqD9uk7E
Tested on a MacBook Pro with Apple M4 Max and 36GB. Getting 75.36 t/s
Gemma 4 is the first release in a while that makes the local-first future feel closer.
I’ve been pretty convinced for a while now that local-first is where this is going.
What feels different now is that it’s starting to make sense all at once: technically, economically, and culturally.
I ❤️ local inference getting faster and cheaper on real hardware. Once developers can benchmark what actually runs on their own machines, cloud-only assumptions start to break down fast.
Ollama is now updated to run the fastest on Apple silicon, powered by MLX, Apple's machine learning framework.
This change unlocks much faster performance to accelerate demanding work on macOS:
- Personal assistants like OpenClaw
- Coding agents like Claude Code, OpenCode, or Codex
Using OpenAI Symphony to push further on harness engineering.
I wanted agents to pull work directly from GitHub Projects, so I built the GitHub tracker adapter:
👉 https://t.co/WEAW7CdQUv
i built an agent skill for powerpoint decks. both the slides and the tooling were not straight forward.
the interesting part about writing a skill for ppt slides is what i learned making tools work when the user is a language model.
llms don't read "--help". they hallucinate field names. they'll dump your entire schema into context and wonder why they can't reason anymore.
what actually works: runtime introspection over docs, structured errors over helpful messages, validation before mutation, progressive disclosure over
dump-everything.
this isn't a slides thing. it's a pattern. every tool you build for agents will need this.
The fun part of building agent-slides wasn't competing with billion-dollar companies.
It was proving that a well-crafted skill file + a CLI that treats agents as untrusted operators + open standards can match proprietary tooling.
I built an open-source agent skill that generates PPT decks from a single prompt.
Yes, Claude Cowork can do this natively now. I know. I did it anyway - because I wanted it to work in any agent, not just one.
Claude Code, Pi, Codex. Your pick.
👉 npx skills add https://t.co/d9DUKBej53
It is hard to communicate how much programming has changed due to AI in the last 2 months: not gradually and over time in the "progress as usual" way, but specifically this last December. There are a number of asterisks but imo coding agents basically didn’t work before December and basically work since - the models have significantly higher quality, long-term coherence and tenacity and they can power through large and long tasks, well past enough that it is extremely disruptive to the default programming workflow.
Just to give an example, over the weekend I was building a local video analysis dashboard for the cameras of my home so I wrote: “Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me”. The agent went off for ~30 minutes, ran into multiple issues, researched solutions online, resolved them one by one, wrote the code, tested it, debugged it, set up the services, and came back with the report and it was just done. I didn’t touch anything. All of this could easily have been a weekend project just 3 months ago but today it’s something you kick off and forget about for 30 minutes.
As a result, programming is becoming unrecognizable. You’re not typing computer code into an editor like the way things were since computers were invented, that era is over. You're spinning up AI agents, giving them tasks *in English* and managing and reviewing their work in parallel. The biggest prize is in figuring out how you can keep ascending the layers of abstraction to set up long-running orchestrator Claws with all of the right tools, memory and instructions that productively manage multiple parallel Code instances for you. The leverage achievable via top tier "agentic engineering" feels very high right now.
It’s not perfect, it needs high-level direction, judgement, taste, oversight, iteration and hints and ideas. It works a lot better in some scenarios than others (e.g. especially for tasks that are well-specified and where you can verify/test functionality). The key is to build intuition to decompose the task just right to hand off the parts that work and help out around the edges. But imo, this is nowhere near "business as usual" time in software.
Voice AI turn taking is a solved problem.
The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.)
@mark_backman made a @pipecat_ai PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it.
The approach combines three layers of processing:
1. Voice activity detection, with a short (200ms) trigger.
2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed.
3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context.
None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year.
Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection.
Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response.
- ✓ means the agent should respond normally (immediately)
- ○ is a "short incomplete" - the agent should wait 5 seconds
- ◐ is a "long incomplete" - the agent should wait 10 seconds
The wait times, and the details of the prompt, are configurable, of course.
Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency.
Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful.
The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.
Interesting read: Assume no technical ceiling. Compute gets cheaper. RL scales. Reasoning deepens. The harder question might be economic: does cost fall faster than performance rises?
I still see a lot of people discussing LLMs as next-token predictors, which is by now quite a misunderstanding. A related opinion is that LLM progress will probably plateau. This post explains why I don't think the "plateau" argument holds up. https://t.co/fJPBoWs2aX
Frontier work is tempting. Labs, tiny teams, massive compute. I get it.
But I keep coming back to the boring layer: integration, reliability, compliance, workflow capture. The stuff that makes AI usable in real companies.
AI power will keep improving.
The moat won’t be access. It’ll be who makes it work.