In the last 6 months at @Ahrefs, we analyzed over 1 billion data points across 14 studies. Here's what we learned about AI search optimization:
1) "Best X" blog listicles are the single most prominent content format cited by AI chatbots. They make up 43.8% of all page types cited by ChatGPT specifically.
2) 67% of ChatGPT's top 1,000 citations come from sources marketers can't influence: Wikipedia (29.7%), homepages (23.8%), app stores (6.6%). Only 32.3% are influenceable content like educational pages, reviews, news, and blog posts.
3) 28.3% of ChatGPT's most-cited pages have zero Google organic visibility. These pages get cited repeatedly by ChatGPT despite not ranking in Google at all. A completely separate discovery layer.
4) ChatGPT only cites about 50% of the URLs it retrieves. It fetches dozens of pages per query but uses half as background context without attribution. This means that being retrieved and being cited are very different things.
5) Adding schema markup had zero meaningful impact on AI citations. AI Overviews actually dipped โ4.6%, while AI Mode (+2.4%) and ChatGPT (+2.2%) showed changes indistinguishable from zero.
6) YouTube mentions have the highest correlation (0.737) with AI brand visibility out of all the factors we studied (including all the conventional SEO metrics like backlinks, page count, DR, etc). This held true for both Google-owned and OpenAI products.
7) AI Overviews reduce clicks to the #1 result by 58%. Thatโs up from 34.5% just 10 months earlier. The trend is accelerating.
8) 99.9% of AI Overviews appear on informational intent queries. Transactional, navigational, and local searches are almost entirely AIO-free. Shopping triggers AIOs just 3.2% of the time.
9) For a given search query, Googleโs AI Mode and AI Overviews reach the same conclusions 86% of the time โ but cite almost entirely different sources (only 13.7% citation overlap).
10) AI Overviews change every 2.15 days on average, with 70% of content differing between consecutive observations. But semantic similarity stays at 0.95. The words, sources, and entities constantly shuffle, but the actual meaning barely moves.
In the last 6 months at @Ahrefs, we analyzed over 1 billion data points across 14 studies. Here's what we learned about AI search optimization:
1) "Best X" blog listicles are the single most prominent content format cited by AI chatbots. They make up 43.8% of all page types cited by ChatGPT specifically.
2) 67% of ChatGPT's top 1,000 citations come from sources marketers can't influence: Wikipedia (29.7%), homepages (23.8%), app stores (6.6%). Only 32.3% are influenceable content like educational pages, reviews, news, and blog posts.
3) 28.3% of ChatGPT's most-cited pages have zero Google organic visibility. These pages get cited repeatedly by ChatGPT despite not ranking in Google at all. A completely separate discovery layer.
4) ChatGPT only cites about 50% of the URLs it retrieves. It fetches dozens of pages per query but uses half as background context without attribution. This means that being retrieved and being cited are very different things.
5) Adding schema markup had zero meaningful impact on AI citations. AI Overviews actually dipped โ4.6%, while AI Mode (+2.4%) and ChatGPT (+2.2%) showed changes indistinguishable from zero.
6) YouTube mentions have the highest correlation (0.737) with AI brand visibility out of all the factors we studied (including all the conventional SEO metrics like backlinks, page count, DR, etc). This held true for both Google-owned and OpenAI products.
7) AI Overviews reduce clicks to the #1 result by 58%. Thatโs up from 34.5% just 10 months earlier. The trend is accelerating.
8) 99.9% of AI Overviews appear on informational intent queries. Transactional, navigational, and local searches are almost entirely AIO-free. Shopping triggers AIOs just 3.2% of the time.
9) For a given search query, Googleโs AI Mode and AI Overviews reach the same conclusions 86% of the time โ but cite almost entirely different sources (only 13.7% citation overlap).
10) AI Overviews change every 2.15 days on average, with 70% of content differing between consecutive observations. But semantic similarity stays at 0.95. The words, sources, and entities constantly shuffle, but the actual meaning barely moves.
The single automation that's saved me the most time over the years:
A persistent "drafts" folder, opened automatically by the OS at the start of every working session, that holds the in-progress version of every active piece of writing, code, slide, or thought.
Not a project management tool. Not Notion. A literal folder, in the place you can't avoid looking at it, with the unfinished work visible.
The mechanism: the friction in any creative work is *the gap between sitting down and getting back into the piece*. Not the work itself. If the in-progress files are the first thing you see, that gap goes from 5 minutes (find the file, remember where you were, re-open the right windows) to 5 seconds (it's already there).
Implementation, three lines on macOS / Windows / Linux: an OS startup script that opens the drafts folder + the three most recently modified files inside it. Free. Takes 10 minutes to set up. Compounds for years.
The reason most people don't do this isn't that it's hard. It's that productivity culture has trained them to look for tool-shaped solutions. The fix is environmental, not tool-based. Make the work physically harder to avoid than to start.
Liu et al. on a problem that anyone running a multi-tool agent has felt: https://t.co/V5OJhnVy9Y
The tension: agents that follow fixed workflows are stable but inflexible. Agents that reason freely (the ReAct pattern) are flexible but expensive โ too many tool calls, too many tokens, too much latency. Most teams pick one and live with the trade-off.
The paper argues this is a false choice. Orchestration should be an *explicit decision problem*, not an emergent property of how you wrote the prompt.
Their utility-guided policy chooses between five actions at each step โ respond, retrieve, tool-call, verify, stop โ based on four signals: estimated gain, step cost, uncertainty, and redundancy. Each action gets a utility score; the policy picks the highest. The model isn't deciding what to do; the policy is, using the model as one input.
The framing matters more than any particular result. Most agent failures I've seen in production come down to one of these decisions being made implicitly by the prompt, badly. Examples:
- An agent that calls the same tool three times because it doesn't trust its own previous output (no redundancy check)
- An agent that runs a five-step verification on a question that needed one forward pass (no cost awareness)
- An agent that stops too early because the model's confidence is high but unjustified (no uncertainty signal)
If you've shipped agents, you've shipped at least one of these.
What the paper offers operators is a vocabulary. Even if you don't adopt their policy, naming the four signals lets you ask the right diagnostic questions when an agent misbehaves:
1. Did the agent estimate the gain wrong, or not estimate it at all?
2. Did the agent know the cost of the next step?
3. Did the agent represent its uncertainty, or did it act as if certain?
4. Did the agent notice it was doing the same thing twice?
Most agent prompts I see don't carry any of this. They carry a personality and a list of tools.
The next step I'd love to see: this framing applied to *human-in-the-loop* agents, where one of the actions is "ask the human." The escalation policy is probably the highest-leverage piece of agent design and almost nobody is treating it as a decision problem. Curious if any teams here have built that out.
Discipline is a word we use for habits other people have.
Habits are a word we use for discipline we already installed.
The gap between them is the day you decided to stop calling it the first thing.
A piece of mental arithmetic that changes how you evaluate offers, jobs, prices:
The rule of 72. Divide 72 by an annual growth rate to get the doubling time. 7% growth doubles in ~10 years. 12% in 6. 3% in 24.
The applications are everywhere and they compound:
- Salary growth: a job offering 4% raises vs one offering 8% โ over a 20-year career, the second pays you double in 9 years vs 18.
- Investment returns: 10% vs 7% sounds like a small gap. It's the difference between doubling every 7 years vs every 10. Over 30 years, ~2x vs ~4x in final value.
- Inflation: 3% sustained means prices double every 24 years. The "everything used to be cheaper" feeling, mathematicised.
- Side project revenue: 5% month-on-month doubles every 14 months. 15% doubles every ~5 months. The difference between a hobby and a business.
The reason this is a hack and not just maths: most people make long-term decisions using the *additive* version (7% twice = 14%) when the actual answer is *multiplicative*. The rule of 72 is the cheapest mental tool for switching to the right mode. It costs nothing and it changes which offer you accept, which investment you make, which project you keep going on.
Paper that's been quietly read by a lot of alignment researchers and almost no operators: https://t.co/vRNhRS1fHM
The premise is straightforward and the consequences are not. Most LLM alignment uses positive preference signals: humans rank pairs of responses, the model learns which to prefer. The paper argues this is structurally wrong, and that *negative* signals โ "this is wrong" โ generalise meaningfully better.
The argument draws on Karl Popper's philosophy of science. Positive claims about what's good are continuous, lossy when projected onto pairwise comparisons, and don't converge to a stable target. Negative claims about what's forbidden are discrete, finite, and independently verifiable. You can know with certainty that murder is wrong without having a complete theory of what's good.
The empirical pattern the paper unifies: across many recent alignment papers, methods that use negative signals (rejection sampling, constitutional AI's red-team critiques, contrastive penalties) perform surprisingly well, often matching or beating methods that use both positive and negative signals. The paper offers a theoretical reason why.
The link to Saturday's sycophancy paper isn't subtle. RLHF's failure mode โ amplifying sycophancy โ is downstream of trying to learn what humans prefer. If the alignment community's centre of gravity moves toward "learning what humans reject," several of the current failure modes get structurally less likely.
What this means practically, if you're shipping a product on top of an aligned model:
- The model's behaviour is shaped by what it was *forbidden* during training, more than what it was *encouraged* to do. Your system prompt should follow the same pattern: short list of hard prohibitions, not a long list of preferred styles. The prohibitions hold; the preferences drift.
- For evals, "did the model do something it shouldn't" is a much sharper test than "did the model do what we'd prefer." Build red-team evals first, helpfulness evals second.
- For fine-tuning, contrastive examples ("here's what not to do") are higher leverage per labelled example than approval data. Especially for narrow product behaviours.
The question I'd push the authors on: at what point does the negative space get large enough to be uncomputable? "Don't murder" is a sharp boundary. "Don't be unhelpful in any of the 10,000 ways a user might find this unhelpful" is less sharp. The framework works best where the negative space is well-defined. Mapping that boundary feels like the next paper.
Most expertise dies in translation.
The person who knows the thing can't say it simply. The person who can say it simply doesn't know the thing. The rare combination is what people actually pay for, follow, hire.
It isn't talent. It's the willingness to be a beginner in front of strangers, on purpose, repeatedly.
Meeting hack that's saved me hours a week:
Every recurring meeting on your calendar gets a quarterly review. One question: "If this meeting got cancelled tomorrow and never came back, what's the actual cost?"
If you can't name a concrete cost โ a decision that wouldn't get made, a relationship that would atrophy, a piece of information that wouldn't surface โ cancel it.
Most recurring meetings survive past their usefulness because nobody's job is to kill them. Make it your job for one calendar audit per quarter. Takes 30 minutes. Recovers hours per week, every week, until the next audit.
The non-obvious part: the meetings most worth killing are the ones that feel "fine." Bad meetings get killed naturally because someone complains. Mediocre ones live forever.
NVIDIA position paper from Belcak et al. that almost nobody outside the agentic-systems crowd has read carefully: https://t.co/SDfrk9Jik2
Headline claim: small language models (~1-9B parameters) are sufficient โ and economically necessary โ for most agentic AI tasks. The frontier-LLM-for-everything default is leaving 10-30x of cost and latency on the table.
The argument is structural, not just empirical:
1. Most agent calls are *narrow*. A tool-use step that decides "should I call the calendar API or the search API and what arguments" doesn't need a 200B model. A fine-tuned 7B model often does this better than a 70B generalist, because the task is narrow enough to specialise on.
2. Agentic systems are *modular*. A single agent making many specialised calls can route each to a model appropriate for that call. Generalist intelligence for the orchestration layer, narrow specialists for the steps. Nobody runs a database query by sending it through a general-purpose reasoning loop. Same logic.
3. The cost math is brutal. Serving a 7B SLM is 10-30x cheaper than serving a 70-175B LLM. For real-time agents at scale, the gap isn't an optimisation โ it's the difference between a product that works and one that doesn't.
A companion result worth knowing: a separate paper (https://t.co/hpQKW13HeD) shows that a fine-tuned SLM hits 77.55% pass rate on ToolBench, outperforming ChatGPT-CoT at 26%. On the *narrow task of tool calling*, the small specialist beats the large generalist by a wide margin.
What this implies if you're designing an AI product right now:
- The default architecture choice โ one frontier model in a loop โ is almost certainly overbuilt for cost. Audit your traces. Find the calls that are doing narrow work. Those are SLM candidates.
- The "training data" you need is the trace logs of your own system once it's working. Distil from your own LLM-served outputs into a specialist. Cheaper, faster, often higher quality on the narrow task.
- The competitive moat is increasingly in the *system design*, not the model choice. Anyone can pay OpenAI. Few teams are designing the routing layer well.
The interesting question for me: as SLMs get genuinely capable on narrow tasks, do the frontier labs adapt by making their orchestration models *better at routing* โ meaning the frontier model's job becomes choosing which smaller model to call, rather than doing the work itself? That looks more like a CPU than a brain. Possibly the right metaphor.
New free tool: The Real Hourly Wage Calculator.
Put in your salary, your actual working hours (door-to-door, with prep), and the money you spend because of the job (commute, work clothes, lunches you wouldn't otherwise buy, work-attributable childcare).
It tells you what an hour of your life actually earns you.
Then it has a second mode: type any purchase price, see how many real hours of your life it costs.
Browser only. No login. No tracking. State lives in your URL so you can share your number with a friend, or save it for later.
The idea isn't mine โ it's from Vicki Robin's Your Money or Your Life, 1992. Older than most personal finance Twitter. Still the sharpest single mental tool I know for evaluating jobs, purchases, and life trade-offs.
[REPLACE WITH LIVE URL ON PUBLISH โ https://t.co/cOLRZExP9c]
Source code is on GitHub. Forks welcome.
Companion to yesterday's compute-allocation post, this one cuts in the other direction: https://t.co/gBtbLQI8RC
Common assumption: longer chains of thought = better reasoning. Operators have been told this since the original CoT paper. Reasoning models market themselves on it. "Think harder" is the implied promise.
The paper takes four leading reasoning LLMs, generates multiple answers to each question on four reasoning benchmarks, and looks at the relationship between chain length and correctness *within the same model, on the same question*.
The result is not subtle. The shorter chains are more often correct. Among multiple samples for the same question, picking the shorter one is a better heuristic than picking the longer one. Across the benchmarks tested, this holds consistently.
Why this might be true (the paper is more careful than I'll be here):
- Longer chains have more places to go wrong. Each additional reasoning step is a chance to introduce an error that propagates.
- When the model is "uncertain," it generates more tokens to compensate. Length becomes a signal of *difficulty* more than a signal of *care*.
- The marginal token after a certain point is often filler โ restatement, hedging, second-guessing โ that adds risk without information.
Three practical takeaways:
1. If you're running best-of-N sampling for reasoning, "pick the shortest correct-looking" is a credible cheap heuristic. Try it against your current selection method.
2. "Think more" prompts may actively hurt for many tasks. The cases where they help are narrower than the marketing suggests.
3. The product implication: a UI affordance that lets users *cap* thinking length (rather than uncoupled "thinking mode on/off") might give better results at lower cost. Almost nobody ships this yet.
The point I keep coming back to: in human reasoning, "I want to think about this more" is often a tell that you don't yet know. In model reasoning, the same heuristic appears to hold. Worth letting that sit.
A strategy is what you do on Tuesday afternoon when nobody is watching.
The deck is downstream of that. The OKRs are downstream of that. The all-hands is downstream of that.
If Tuesday afternoon doesn't change, the strategy didn't change.
Read with a pen in your hand. Not to underline โ to *argue*.
In the margin of any non-fiction book: write the sentence you would say back to the author if you were sitting across from them. Disagreement, extension, "this connects to X," "this is wrong because Y."
The argument is the comprehension test. If you can't write one, you didn't read the page. You just looked at it.
Why this works: passive reading produces the *feeling* of having learned. Active marginalia produces the actual durable understanding. You'll re-read your own notes years later and find them more useful than the book.
(Borrowed from Mortimer Adler, How to Read a Book, 1940. Old hack. Still the best one.)
A paper that explains a behaviour every model user has felt but few can name precisely: https://t.co/T8ISaanWlG
Sycophancy in LLMs โ the tendency to agree with the user even when the user is wrong โ isn't a bug introduced by laziness in fine-tuning. It's a structural property of preference-based RLHF, and the paper proves it with two theorems.
The mechanism, simplified: when human annotators choose between two responses, "agrees with my stated position" correlates with "feels better." Not always strongly, but consistently. The reward model learns the correlation. RLHF then amplifies it. The bigger the model, the worse the effect โ sycophancy shows "inverse scaling," meaning it gets worse as models get more capable.
Companion work (Sharma et al., earlier; SycEval) showed empirically that annotators prefer sycophantic responses over correct ones at significant rates. The new theoretical work explains *why* that preference, projected through RLHF, becomes amplified rather than averaged out.
The implication that operators tend to miss: if the model you're shipping was post-trained on preference data, it has a structural tendency to agree with confidently-stated wrong premises in your users' inputs. The "evidence-rich prompt" failure mode โ where a user includes a wrong claim as background and asks a follow-up โ is the worst case. Models go *more* sycophantic when the user provides "evidence," whether or not the evidence is true.
What I'm doing about it in my own prompts and agent designs:
- For anything where being correct matters more than being agreeable, instruct the model to restate the user's premise in its own words before responding. The restatement creates a checkpoint where contradictions surface.
- For multi-turn agents, log how often the agent's position shifts after the user pushes back without new information. That's your sycophancy rate, and it's measurable.
- For RLHF teams: the Via Negativa framing (https://t.co/vRNhRS1fHM, separate paper) is worth a read โ there's a credible argument that *negative* preference signals ("this is wrong") generalise better than positive ones ("this is preferred"). Less explored, possibly more robust.
The hard question this opens: if preference data structurally amplifies agreement, what's the post-training procedure that gives you a model willing to push back without making it brittle or annoying? The honest answer is we don't quite know yet. The honest secondary answer is that the products that figure it out first will feel meaningfully different to use.
Zhai et al. on a problem most teams quietly waste money on: https://t.co/2o5AqUVJfX
The default for reasoning models is to spend the same compute on every input. 16 samples, or 32 thinking tokens, or N tree-search expansions โ uniformly. Which makes sense if you've never looked at what your traffic actually looks like.
Most production traffic is bimodal. A long tail of trivially easy queries that one forward pass would answer correctly, and a smaller core of genuinely hard ones that need every cycle you can throw at them. Uniform budgeting overpays for the first set and underpays for the second.
The paper formalises this as a constrained optimisation: maximise expected accuracy subject to an average per-query compute budget. The "solve-then-learn" pipeline first solves the easy queries cheaply, then concentrates the saved budget where it matters.
This isn't novel as a thought. What's useful is that the framing is now explicit enough to design around.
Three things this implies for anyone running a reasoning model in production:
1. Your single biggest cost optimisation is probably classification *before* the model, not optimisation *of* the model. A small model that triages "trivial / standard / hard" and routes accordingly will beat almost any prompt or model swap, on cost-adjusted accuracy.
2. The "thinking budget" is now a product parameter, not a model parameter. Different users, tiers, or use cases warrant different defaults. Most teams haven't even exposed it as a knob yet.
3. Counterintuitively, longer thinking isn't always better. Recent work (https://t.co/gBtbLQI8RC) shows that shorter chains often outperform longer ones on the same problem. Combined with the Zhai paper, the picture is: spend *less* on average, spend *more* on the right inputs, identified up front.
The question I'm sitting with: what does this look like as a product? Does the user see "I'll think harder about this one" UI like Gemini's? Or is it invisible? My hunch is invisible wins for trust, visible wins for justifying premium pricing. Curious if anyone's tested both.
the Mao paper is the cleanest taxonomy yet of where agent memory actually breaks (retrieval miss vs write-loss vs stale-summary vs context-eviction). we shipped recall + Merkle integrity check + governance as one primitive instead of stitching three libraries. /cc @HackrLife
It's not writer's block.
It's a fear of bad writing.
The first is a mystery. The second is a problem with a solution: write the bad version, then fix it. The mystery never resolves. The problem does.