Another day, another notable release by the @googlegemma team - the QAT checkpoints mean you benefit from compressed models which basically suffer no accuracy loss 🤯
We just dropped Gemma 4 Quantization-Aware Training (QAT) checkpoints on Hugging Face!
All Gemma 4 model sizes and their drafters are now optimized with QAT to cut memory requirements and maximize on-device performance!
What a cool model to run locally. I just found my favorite new coding companion and will have a lot of fun with the Collider app: https://t.co/l3vIF9Ix2C
Introducing Magenta RealTime 2, a new open model musicians can play as an instrument!
Run low-latency, live music synthesis natively on your MacBook using MIDI, text, and audio. 🎶
We love seeing Google’s open model ecosystem grow!
Gemma 4 12B is a great addition to the Gemma 4 family, especially if you want to run multimodal agents locally.
What makes it stand out is its encoder-free architecture. Instead of separate vision and audio encoders adding latency, raw signals project directly into the LLM backbone.
This means native, low-latency multimodal reasoning on an everyday 16GB laptop.
Learn more: https://t.co/xtVlh9Xi97
Today we’re introducing Gemma 4 12B — our latest open model that brings advanced agentic reasoning, vision and audio directly to your laptop.
It delivers performance nearing our larger Gemma models with a much smaller total memory footprint, while being small enough to run locally with just 16GB of VRAM. It’s open and accessible for everyone to use under a permissive Apache 2.0 license.
This is all made possible by our new, unified architecture that removes separate multimodal encoders. Here’s how we did it 🧵
We’ve rolled out a new version of Gemini 3.5 Flash in Antigravity that boasts much less and has higher endurance on harder tasks. Thanks for all of the feedback on the model. Keep it coming, we will act quickly across the stack to make the experience even better.
We’ve also gone ahead and reset Gemini rate limits for all users so you can start running this new model immediately.
I enjoy seeing new benchmarks like DeepSWE and ProgramBench hit the spotlight as models become more capable. Previous benchmarks are getting saturated and it's harder to meaningfully compare what these systems can actually do.
What I especially appreciate: cost is now an axis. That matches a builder's reality far more closely than most evaluations out there.
https://t.co/Xm7Rh4dzZB
https://t.co/bIe0Xhipjr
The team at @cursor_ai released their Developer Habits Report showing the massive shift in how software is built. According to the report, AI isn't leveling the playing field - it's widening it.
Here are my 3 takeaways for engineering leaders:
1. P99 power users are producing 46x more lines of code and merging 15x more PRs than the median.
AI productivity is highly concentrated. Rollout is easy; scaling the habits, workflows, and prompt patterns of your top 1% is where the actual value lies.
2. Context is the new compiler. The input-to-output token ratio is spiking, and cache-reads now account for ~90% of token activity.
Clean codebase architecture and robust workspace indexing are now direct drivers of model output quality. Spaghetti code = bad AI results.
3. Trust is shifting to automation. Over 36% of agent-generated changes are now accepted and committed without manual review.
The bottleneck has officially moved from writing code to validating it. Without automated testing and security guardrails, agentic throughput will stall.
> We are moving from "copilots" helping individuals to agents acting as development infrastructure.
The challenge now isn’t the quality of the raw model—it’s the quality of the system you build around it.
https://t.co/sQQI8YoJjk
🔬 It's always fun to take new open models for a spin — StepFun's Step 3.7 Flash (MoE) dropped today, so I ran the NVFP4 variant on my 2x DGX Spark setup.
First impressions:
→ Prefill throughput is solid (~2.7-3K t/s)
→ Decode is on the slower side (~21-42 t/s depending on concurrency)
→ The NVFP4 variant doesn't ship MTP-layer weights — that's a miss
→ KV cache is hungry
Tool-calling quality scored a perfect 100/100 on tool-eval-bench — all 15 scenarios passed. But responsiveness landed at 30/100 with a 5.3s median turn time.
The pattern is interesting: high quality output, but the latency cost is real. Usable for daily experiments, but this quant doesn't quite compete with faster options for interactive use.
Release: https://t.co/URnYBV9i0F
NVFP4 variant: https://t.co/xMq35Qh99n
Typically vision-language models decode bounding boxes the same way they decode text — one coordinate token at a time. x1, then y1, then x2, then y2. Sequentially. It works, but it's slow and the coordinates have no awareness of each other during generation.
NVIDIA's LocateAnything-3B takes a different approach: Parallel Box Decoding. Each bounding box is predicted atomically in a single forward pass. The result is significantly faster decoding throughput and better localization accuracy — because the coordinates are geometrically coherent by design, not by luck.
What makes it interesting for you? It's a single 3B-parameter model (built on Qwen2.5-3B) that handles document understanding, GUI grounding, dense object detection, and OCR localization under one unified architecture. Small enough to run locally, capable enough to be useful.
There's a live demo on HuggingFace if you want to try it before reading the paper.
🤗 https://t.co/6UErIWqwIH
📄 https://t.co/V41HqhK9W2
#AI #ComputerVision #ObjectDetection
Most tool-calling benchmarks test models in ideal conditions — clean context, well-formed payloads, single-turn. That's not how agents work in production.
I built tool-eval-bench to find out what actually breaks. 74 deterministic scenarios testing multi-turn chains, safety boundaries, structured output, and error recovery — against any OpenAI-compatible endpoint (vLLM, llama.cpp, LiteLLM). Mocks inject realistic noise (extra metadata, timestamps, nested objects) because real APIs are messy.
The feature I keep coming back to: --context-pressure. It pre-fills your context window before each scenario to simulate real agentic load. In my testing, most models hold up fine through 50% pressure. Past 75%, tool selection degrades, parameters get hallucinated from earlier context, and multi-turn chains collapse. The breaking point depends as much on your KV cache config as on the model itself.
Also includes --spec-live for a live terminal view of speculative decoding acceptance rates, and integrates with llama-benchy for prefill/decode throughput sweeps.
Heavily inspired by @stevibe's BenchLocal — I wanted to extend that foundation with multi-turn edge cases, structured output schemas, and pressure testing under load.
https://t.co/dgZN8uqygi
#AgenticAI #LLMs #vLLM
Antigravity CLI 1.0.3 is just out!
Now you can use Google AI credits when quota runs out.
- /config -> UseF1Credits to turn it on.
/credits to check balance.
- Enhanced logo on Apple Terminal and more informational color scheme preview panels.
- Improved /diff experience and various critical fixes.
Getting started: https://t.co/EfaMSLXLX1
💡 The most underrated AI coding technique isn't writing code faster. It's writing better code more slowly.
👉 The insight from Nolan Lawson's approach: run multiple models on every PR, cross-validate their findings, fix what's real. Near-zero false positives.
💪 That's not vibe coding. That's engineering discipline.
https://t.co/xXfJolmstp
Theo hits a lot of points that ring true in my own workflow.
→ Learn how to best interact with the diff models
→ Get a remote coding setup (no half open laptop😅)
→ Play around with style of your AGENTS.md file
→ Create a gold standard spec for reference
Take the time to teach your agent how you like to work with proper context and references, you'll notice the difference over time💡
Antigravity CLI 1.0.1 is out.
Key updates:
- Fixed OAuth not persisting in some environments.
- Enhanced the visual experience on Windows.
- Added the new "proceed in sandbox" permission control.
Restart agy to auto update or run “agy update".
See the full changelog for details:
https://t.co/zJNgkJck3Z
We just 3xed the rate limits across all tiers in Antigravity so that you can put 3.5 Flash through its paces even more, enjoy, and keep the feedback coming! :)