Adaline

@tryadaline

Iterate, evaluate, deploy, and monitor LLMs.

Playground

Joined January 2024

2 Following

824 Followers

551 Posts

Adaline @tryadaline

about 14 hours ago

The product playbook for what to build on top https://t.co/WxSR9jnhN5

Adaline @tryadaline

about 14 hours ago

MCP fixed integration, but it did not fix reliability. The protocol standardizes how the agent reaches the tool. It does not standardize how the tool behaves once reached. Reliability sits on top of the connector layer, not inside it.

Adaline @tryadaline

about 14 hours ago

Two reads worth pairing on this: The technical primer on MCP itself https://t.co/c3ApIqOicv

Adaline @tryadaline

1 day ago

Where tool calls actually break https://t.co/7OC71zTAbT

Adaline @tryadaline

1 day ago

A failed tool-calling agent rarely fails at the model. It fails in the span. Four checks that find the cause faster than any model debugging: 1. Look at the step, not the output. 2. Check the tool contract. 3. Replay the state at the failed step. 4. Look at retries.

tryadaline's tweet photo. A failed tool-calling agent rarely fails at the model. It fails in the span.

Four checks that find the cause faster than any model debugging:

1. Look at the step, not the output.
2. Check the tool contract.
3. Replay the state at the failed step.
4. Look at retries. https://t.co/PvtrnOtQbz

Adaline @tryadaline

1 day ago

Two reads worth pairing on this: Agent observability, end-to-end https://t.co/M6yllmeXYF

Adaline @tryadaline

3 days ago

Full breakdown on all four memory scopes: https://t.co/d97UcinfLd

Adaline @tryadaline

3 days ago

Procedural memory is the most dangerous kind of stale memory in production agents because the agent has no idea it’s wrong. It stores how the agent does things: which tool to call, in what order, with what parameters. The three memory types fail differently: • Session memory: The agent re-asks for something the user already said. The user notices immediately. • Entity memory: The agent maps a user to an outdated profile. Recommendations fell off. Trust erodes slowly. • Procedural memory: The agent executes a learned workflow against a tool that changed its API. No error gets raised. The output looks plausible. The first two failures surface eventually. Someone notices and escalates. Procedural failures don't get escalated; they get attributed to "the agent being a bit off.” The staleness compounds until the workflow is consistently wrong and nobody knows why. This is why procedural memory needs the shortest refresh cycle of any scope. Not because it changes most often, but because when it’s wrong, nothing in the system will tell you.

Adaline @tryadaline

6 days ago

Observability vs. monitoring for agentic AI — what the distinction actually means in production: https://t.co/GQ9p6Ln4ia

Adaline @tryadaline

6 days ago

Monitoring tells you that an agent failed, but observability tells you which step in the sequence caused it. For a single LLM call, the distinction barely matters. For a multi-step agent with tool calls, branching logic, and intermediate states, the distinction is the difference between a two-hour fix and a two-week investigation. The four things production agent observability actually requires: • Input traces per step tell you what the agent received at each stage, not just the final prompt. Without them, you can only see the end state. • Tool call logs capture which tool was called, with what parameters, and what it returned. This is the layer where silent failures hide. • Intermediate decision points show where the agent chose one path over another and on what signal. • Eval attachment links evaluations to specific execution traces so you can see what the eval found on the exact run that failed. Build this before the first failure you cannot reproduce. By then, the trace is gone.

tryadaline's tweet photo. Monitoring tells you that an agent failed, but observability tells you which step in the sequence caused it.

For a single LLM call, the distinction barely matters. For a multi-step agent with tool calls, branching logic, and intermediate states, the distinction is the difference between a two-hour fix and a two-week investigation.

The four things production agent observability actually requires:

• Input traces per step tell you what the agent received at each stage, not just the final prompt. Without them, you can only see the end state.
• Tool call logs capture which tool was called, with what parameters, and what it returned. This is the layer where silent failures hide.
• Intermediate decision points show where the agent chose one path over another and on what signal.
• Eval attachment links evaluations to specific execution traces so you can see what the eval found on the exact run that failed.

Build this before the first failure you cannot reproduce. By then, the trace is gone.

Adaline @tryadaline

7 days ago

Full breakdown on LLM-as-a-judge bias and how to build evaluation pipelines that account for it: https://t.co/PrCdP3yhM0

Adaline @tryadaline

7 days ago

LLM judges rate longer responses higher, and it is not because length correlates with quality in your task domain. This is because length correlates with quality in the training data. Human annotators rate more complete-looking answers higher. More words read as more effort, and models trained on that signal learn the proxy rather than the underlying quality criterion. Here is what this does to your system over time: your production model learns that verbose answers score better, because they consistently do. The feedback loop runs quietly until someone checks whether longer is actually correct more often, and finds out it is not. Write length-independence into your eval rubric. Tell the judge that brevity is acceptable when brevity is correct. Calibrate against examples where the short answer is correct, because this bias does not correct itself.

tryadaline's tweet photo. LLM judges rate longer responses higher, and it is not because length correlates with quality in your task domain. This is because length correlates with quality in the training data.

Human annotators rate more complete-looking answers higher. More words read as more effort, and models trained on that signal learn the proxy rather than the underlying quality criterion.

Here is what this does to your system over time: your production model learns that verbose answers score better, because they consistently do. The feedback loop runs quietly until someone checks whether longer is actually correct more often, and finds out it is not.

Write length-independence into your eval rubric. Tell the judge that brevity is acceptable when brevity is correct. Calibrate against examples where the short answer is correct, because this bias does not correct itself.

Adaline @tryadaline

8 days ago

How to evaluate coding agents in production — the four metrics that matter and the five failure modes to design tests around: https://t.co/ivBS7jjbDI

Adaline @tryadaline

8 days ago

SWE-bench gives coding agents a known codebase, a clear problem statement, and a test suite that validates the fix. That is not what production looks like. METR found in March 2026 that automated grader scores averaged 24 percentage points higher than what maintainers actually accepted. The benchmark was measuring something different from production readiness. Four things production coding agent evals need that SWE-bench does not test: • Multi-file reasoning: Production tasks require reasoning about files that the agent was not explicitly given. • Tool failure handling: Real tools return malformed responses, and the eval should verify the agent handles them cleanly. • Partial context tolerance: Real requirements are often ambiguous, which benchmark tasks never replicate. • Regression detection: The eval should verify the agent has not touched code outside the task scope. Benchmark scores tell you which models to eliminate. Production evals tell you which ones to ship.

tryadaline's tweet photo. SWE-bench gives coding agents a known codebase, a clear problem statement, and a test suite that validates the fix. That is not what production looks like.

METR found in March 2026 that automated grader scores averaged 24 percentage points higher than what maintainers actually accepted. The benchmark was measuring something different from production readiness.

Four things production coding agent evals need that SWE-bench does not test:

• Multi-file reasoning: Production tasks require reasoning about files that the agent was not explicitly given.
• Tool failure handling: Real tools return malformed responses, and the eval should verify the agent handles them cleanly.
• Partial context tolerance: Real requirements are often ambiguous, which benchmark tasks never replicate.
• Regression detection: The eval should verify the agent has not touched code outside the task scope.

Benchmark scores tell you which models to eliminate. Production evals tell you which ones to ship.

Adaline @tryadaline

9 days ago

What production reliability actually requires beyond MCP: https://t.co/AMvHVMRIeQ

Adaline @tryadaline

9 days ago

MCP standardizes how agents discover and connect to tools. It does not standardize what happens when those connections break. Three things MCP does not handle: • Retry safety: Whether a failed call is safe to retry depends on whether the operation has side effects, and MCP does not carry that information. • Silent failures: When a tool returns null instead of an error, MCP does not surface that signal to the agent. • Observability: Traces that reconstruct what happened across a multi-step sequence are not part of the MCP spec. Engineers who mistake integration ease for production reliability will encounter this during the first real production failure. The protocol handles the connection, but everything that happens after is still your engineering problem.

tryadaline's tweet photo. MCP standardizes how agents discover and connect to tools. It does not standardize what happens when those connections break.

Three things MCP does not handle:

• Retry safety: Whether a failed call is safe to retry depends on whether the operation has side effects, and MCP does not carry that information.
• Silent failures: When a tool returns null instead of an error, MCP does not surface that signal to the agent.
• Observability: Traces that reconstruct what happened across a multi-step sequence are not part of the MCP spec.

Engineers who mistake integration ease for production reliability will encounter this during the first real production failure. The protocol handles the connection, but everything that happens after is still your engineering problem.

Adaline @tryadaline

10 days ago

Read the full blog here: https://t.co/PrCdP3yhM0

Adaline @tryadaline

10 days ago

Using the same model family to generate and judge your outputs isn’t evaluation. It’s self-grading. Three biases that don’t show up in aggregate agreement scores but consistently show up in practice: 1. Position bias: Give the model two responses, and it favors whichever appears first, regardless of quality. 2. Verbosity bias: Longer outputs score higher, not more accurate ones. 3. Self-enhancement bias: When a model judges its own outputs against a competitor, it rates itself higher even when the outputs are identical. The research case for LLM-as-a-judge is solid. The case for running it without calibration is not. Which of these has burned your eval pipeline?

Adaline @tryadaline

13 days ago

https://t.co/lIia6aWSCM

Adaline @tryadaline

13 days ago

Your agent passed every internal test. In production, it completed fewer than 1 in 3 tasks correctly. Not because the model was weak, but because production is a completely different test. Here are the five failure modes that arrive at the same time: • Context rot: Quality degrades across turns without any error being thrown. • Tool execution unreliability: The model produces confident responses when tools return null or time out. • Evaluation blindness: You find out quality changed when users complain, not when a metric catches it. • Unsafe retry behavior: Retry logic re-runs stateful workflows and creates side effects. • Memory drift: Agents behave inconsistently across sessions with the same user. None of these arrives one at a time. Full reading guide in the first reply.

Adaline

@tryadaline

Last Seen Users on Sotwe

Trends for you

Most Popular Users