If I had infinite free compute, since I’m building Duolingo for speaking and struggling with the model fundamental characteristics like “How can I help you” or “if you want, I can do X” reflections, I would train a brand new model based on OSS model like Moshi so I can achieve real natural conversational behavior unlike assistant…
Can anyone explain to me, in tennis, why slice serve bounds left from servers perspective and kick serve bounds right from servers perspective even just they rotate 30 degrees different… Gemini couldn’t answer physically and mathematically…
I just downgraded to the $100 plan for now (we can switch back anytime) and test-driving both of them… but for now, from my observation, codex tends to stop often.
When I ask why A is B then Claude flips its opinion and says it should have been C and continue, on the contrary, codex just answers my question and pauses even if it admits C was better with “if you want…”.
The original quality-bar problem was developer discipline. Some run lint and type-check before pushing, some don't, the team's bar drifts to the slowest path. Pre-commit fixed it: gates in code, not in habits. AI agents reopened the same problem at a different layer. GitAuto closes it the same way.
The agent layer is where the model writes code without going through any gate by default. Prompting the model to remember to lint is the equivalent of asking a junior developer to remember. Sometimes works, often doesn't, and "often doesn't" is the variance pre-commit was invented to remove.
The fix is structural, not discipline. The orchestrator that drives the agent runs phpcbf to autoformat, phpcs to surface remaining lint, phpstan for static analysis, after every code-writing turn the agent takes. The model has no say in whether they fire. They fire because the runner calls them, the same way pre-commit fires because git calls it.
The agent layer has one extra property the human-commit layer doesn't: the loop has another turn after the gate. So when the formatter can autofix, the orchestrator commits the fix and the agent keeps going. When the analyzer reports something unfixable, the orchestrator hands the errors back to the agent as new instructions for the next turn.
To stop the strict gate from exploding the diff with sweeps across files the change didn't touch, the orchestrator captures a baseline at the start of the agent's work and only blocks on the delta. Pre-existing issues stay pre-existing.
Reusable rule for anyone integrating an LLM into a coding workflow: anywhere you've thought "the model should remember to run X," you actually want X invoked unconditionally by the code around the model. Same logic that put X in your pre-commit hook in the first place.
100% line coverage can still mean garbage tests. A test that calls a function and never asserts the return value hits every line but proves nothing.
We built a 44-check quality evaluator that runs after coverage passes. 9 categories: integration, business logic, adversarial, security, performance, memory, error handling, accessibility, SEO.
The key design choice: each check can be "not applicable." A CLI tool has no accessibility concerns. A math function has no SQL injection surface. Forcing every check on every file just teaches developers to game the system.
Published the full checklist so anyone can see what gets evaluated and why.
Every AI agent provider has the same problem: premium models cost too much to offer a real free tier. 3 free PRs at $8 each is a demo, not a trial.
Google Gemma 4 31B changed the math. Near-zero API cost, thin routing layer, same agent loop. $2/PR, 12 free PRs instead of 3.
A few caveats: tool use is less reliable than Opus, it errors on consecutive same-role messages (need message merging), and it tries to finish early with shorter outputs. That last one is solvable at the agent layer: check the result, give feedback, make it iterate.
I also switched our own repo to Gemma for dogfooding. Used to run Opus on it, which added up fast.
Worth considering if you're building agents and struggling with free-tier economics.
The agent edited a customer's `.circleci/config.yml` to fix a bug that was inside our own AWS Lambda.
Mongo's in-memory binary needed an OpenSSL library our Lambda OS didn't ship. Validation crashed with "library missing". The agent read that, concluded the customer's CircleCI was misconfigured, and added `MONGOMS_DISTRO=ubuntu-22.04` to their config. Customer CI then crashed with a different missing library. Three empty retry commits later the agent gave up.
Each step in the agent's reasoning was defensible. The chain failed because nothing in the input said which runtime produced the error. "Mongo" and "libcrypto" look customer-flavored if you don't know they came from your own infrastructure.
Every log handed to the agent now carries `[log source: OUR ...]` or `[log source: CUSTOMER ...]`. Detected at call time, not hardcoded. The trigger prompt tells the agent: OUR-tagged means keep doing the customer's task, do not write workarounds into their code.
Smaller move than fixing the binary mismatch itself. Bigger long-term payoff — the next runtime mismatch we don't anticipate gets refused on its own.
If your agent is looping on the same tool call, check what your agent loop is pruning.
A task on a customer repo had Claude Opus 4.7 call the same tool with the same args 17 times in a row. About half the per-task budget gone on duplicate work.
The tell in the logs: every one of those repeat calls had no thinking text in the response. Just a bare tool call with no rationale.
Traced it. When the agent re-emitted a duplicate, my agent loop pruned the prior turn from history. Reasonable optimization. But the prune removed the entire assistant turn, including the text where the model had written its plan.
Concrete example. The agent's turn looks like this when it emits a tool call:
```
{"role": "assistant", "content": [
{"type": "text", "text": "Plan: read the file, find the empty section, append three blocks."},
{"type": "tool_use", "id": "tool_42", "input": {...}}
]}
```
Old prune removed the whole message when the file was re-fetched later. New prune strips only the tool call:
```
{"role": "assistant", "content": [
{"type": "text", "text": "Plan: read the file, find the empty section, append three blocks."}
]}
```
So with the old behavior the agent would: write a plan, call a tool, see the result. Next turn the prune wiped the prior plan along with the duplicate. Model now had clean tool history and no idea what it was after. Tries the same call again. And again.
Fix is one if-statement. Prune the duplicate's tool call and result. Keep the reasoning text. Plan stays, loop stops.
Lesson for anyone building agent loops: when you prune duplicate tool calls from conversation history, be specific about what you take. Tool call and tool result, yes. The reasoning the model wrote in the same turn, no.
Building agent products on a fixed per-task price means your LLM cost is variable but your revenue is fixed. Hard tasks regularly cost more than they earn. The model provider doesn't compensate, so I covered the overruns out of pocket.
Then I tried capping spend and halting when the cap is hit. Cleaner on the books, but the task is abandoned at whatever partial state it had reached. Bad trade.
What works for us now: cap the spend on each task, swap from the premium model to a free-tier OSS model when the cap hits, keep full conversation history. The smaller brain picks up where the bigger one left off. No further charges accrue, the task usually finishes.
Worth trying if you're building agents and watching margin walk negative on long tasks.