I break AI models for a living —edge cases, hallucinations, prompt failures | AI Evaluator · Prompt Engineer · Build No/low-code Workflow Automation that works
My job is to argue with AI until it confesses it was wrong, most days I win, some days I leave to come back to it
Here’s what 3 months of breaking LLMs taught me about how they actually fail
The other day i was arguing with someone about how books like 'Rich Dad, Poor Dad' are foolish to Nigerians at large.
One of the points i raised was that financial literacy cannot cover for systemic failures.
This is a prime example of the statement: “you cannot out-hustle a bad system.”
Imagine a loan was taken for this.
There's a million reasons why your app can be slow.
The stack of software and hardware beneath are complex. Pinning down performance issues is challenging (but also, fun!)
A clip from the chapter 1 stream. More to come this week.
The Claude Code team has been shipping with Claude Tag internally all year.
It now writes 65% of our product team's code, including most of what built Claude Tag itself.
Here are a few ways we use it every day: 🧵
https://t.co/7PLrW06TvH
Introducing GLM-5.2: Frontier Intelligence, Open Weights
- Significant improvements in coding and agentic tasks
- Strong long-horizon capabilities with a 1M context window
- Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency
- MIT-licensed open weights
- Same API pricing as GLM-5.1
Tech Blog: https://t.co/LAsxUdN0JZ
Weights: https://t.co/g0A1C4UWx4
API: https://t.co/Kc3E22cbN7
Coding Plan: https://t.co/Nk8Y98HNhU
Chat: https://t.co/WCqWT0qCQb
This model is insane at design.
I asked GLM 5.2 (left) and Opus 4.8 (right) to build me a landing page and you can't even tell the difference.
GLM cost $0.06 while opus cost $0.49. More than 6x cheaper while being faster + more token efficient.
Another win for open source AI.
As a result of a US government directive, we are suspending access to Claude Fable 5 for all users. You can continue to use all other Claude models.
Here’s what this means for you:
Across Claude products, new sessions will run on your selected default model or Opus 4.8, and existing Fable 5 sessions will end with an error.
On the Claude Platform, requests to Fable 5 will also return an error. Please update your integrations to other Claude models.
We know this is a disruption to your workflows; we appreciate your patience and support.
99% of people are using Claude Fable 5 wrong.
People don't know how to work with it yet because nothing this powerful has ever existed.
I'll show you 10+ use cases and startup ideas that can only exist because Fable 5 is here in under 34 minutes.
The next evolution of Hermes Agent is here!
Introducing Hermes Desktop: everything you love about Hermes, now native on your machine.
First demoed in Jensen's GTC keynote, it's now in public preview.
Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors.
Available today at the same price.