MCP fixed integration, but it did not fix reliability.
The protocol standardizes how the agent reaches the tool. It does not standardize how the tool behaves once reached.
Reliability sits on top of the connector layer, not inside it.
A failed tool-calling agent rarely fails at the model. It fails in the span.
Four checks that find the cause faster than any model debugging:
1. Look at the step, not the output.
2. Check the tool contract.
3. Replay the state at the failed step.
4. Look at retries.
Procedural memory is the most dangerous kind of stale memory in production agents because the agent has no idea it’s wrong.
It stores how the agent does things: which tool to call, in what order, with what parameters.
The three memory types fail differently:
• Session memory: The agent re-asks for something the user already said. The user notices immediately.
• Entity memory: The agent maps a user to an outdated profile. Recommendations fell off. Trust erodes slowly.
• Procedural memory: The agent executes a learned workflow against a tool that changed its API. No error gets raised. The output looks plausible.
The first two failures surface eventually. Someone notices and escalates.
Procedural failures don't get escalated; they get attributed to "the agent being a bit off.” The staleness compounds until the workflow is consistently wrong and nobody knows why.
This is why procedural memory needs the shortest refresh cycle of any scope. Not because it changes most often, but because when it’s wrong, nothing in the system will tell you.
Monitoring tells you that an agent failed, but observability tells you which step in the sequence caused it.
For a single LLM call, the distinction barely matters. For a multi-step agent with tool calls, branching logic, and intermediate states, the distinction is the difference between a two-hour fix and a two-week investigation.
The four things production agent observability actually requires:
• Input traces per step tell you what the agent received at each stage, not just the final prompt. Without them, you can only see the end state.
• Tool call logs capture which tool was called, with what parameters, and what it returned. This is the layer where silent failures hide.
• Intermediate decision points show where the agent chose one path over another and on what signal.
• Eval attachment links evaluations to specific execution traces so you can see what the eval found on the exact run that failed.
Build this before the first failure you cannot reproduce. By then, the trace is gone.
LLM judges rate longer responses higher, and it is not because length correlates with quality in your task domain. This is because length correlates with quality in the training data.
Human annotators rate more complete-looking answers higher. More words read as more effort, and models trained on that signal learn the proxy rather than the underlying quality criterion.
Here is what this does to your system over time: your production model learns that verbose answers score better, because they consistently do. The feedback loop runs quietly until someone checks whether longer is actually correct more often, and finds out it is not.
Write length-independence into your eval rubric. Tell the judge that brevity is acceptable when brevity is correct. Calibrate against examples where the short answer is correct, because this bias does not correct itself.
SWE-bench gives coding agents a known codebase, a clear problem statement, and a test suite that validates the fix. That is not what production looks like.
METR found in March 2026 that automated grader scores averaged 24 percentage points higher than what maintainers actually accepted. The benchmark was measuring something different from production readiness.
Four things production coding agent evals need that SWE-bench does not test:
• Multi-file reasoning: Production tasks require reasoning about files that the agent was not explicitly given.
• Tool failure handling: Real tools return malformed responses, and the eval should verify the agent handles them cleanly.
• Partial context tolerance: Real requirements are often ambiguous, which benchmark tasks never replicate.
• Regression detection: The eval should verify the agent has not touched code outside the task scope.
Benchmark scores tell you which models to eliminate. Production evals tell you which ones to ship.
MCP standardizes how agents discover and connect to tools. It does not standardize what happens when those connections break.
Three things MCP does not handle:
• Retry safety: Whether a failed call is safe to retry depends on whether the operation has side effects, and MCP does not carry that information.
• Silent failures: When a tool returns null instead of an error, MCP does not surface that signal to the agent.
• Observability: Traces that reconstruct what happened across a multi-step sequence are not part of the MCP spec.
Engineers who mistake integration ease for production reliability will encounter this during the first real production failure. The protocol handles the connection, but everything that happens after is still your engineering problem.
Using the same model family to generate and judge your outputs isn’t evaluation. It’s self-grading.
Three biases that don’t show up in aggregate agreement scores but consistently show up in practice:
1. Position bias: Give the model two responses, and it favors whichever appears first, regardless of quality.
2. Verbosity bias: Longer outputs score higher, not more accurate ones.
3. Self-enhancement bias: When a model judges its own outputs against a competitor, it rates itself higher even when the outputs are identical.
The research case for LLM-as-a-judge is solid. The case for running it without calibration is not.
Which of these has burned your eval pipeline?
Your agent passed every internal test. In production, it completed fewer than 1 in 3 tasks correctly. Not because the model was weak, but because production is a completely different test.
Here are the five failure modes that arrive at the same time:
• Context rot: Quality degrades across turns without any error being thrown.
• Tool execution unreliability: The model produces confident responses when tools return null or time out.
• Evaluation blindness: You find out quality changed when users complain, not when a metric catches it.
• Unsafe retry behavior: Retry logic re-runs stateful workflows and creates side effects.
• Memory drift: Agents behave inconsistently across sessions with the same user.
None of these arrives one at a time. Full reading guide in the first reply.