Claude Design builds your entire brand design system in 10 minutes by scraping your website.
Then every prototype, deck, and social post you make automatically matches your brand.
Prototype → Claude Code → ship. This is wild.
Watch the full video: https://t.co/qXHTqqaSDz
I built my own AI operator.
It can use any model, connect to your tools in one click, keep persistent memory, and actually get work done.
Because other AI operators are either 1) too expensive, 2) locked to one model, or 3) not reliable.
My favorite feature: it emails me reports on a schedule, and I can reply with feedback for next time like a real co-worker.
I'm opening a small private beta now and giving free access to a few testers in exchange for feedback.
Reply "beta" or DM me for access.
Most AI workflows call an LLM every time they run.
A new paper from Stanford, Cornell, and Harvard Medical School says that's 57x more expensive than it needs to be for any repeating task.
The idea is called compiled AI.
You run the LLM once, during a "compilation phase," to generate executable code for a workflow. After that, the code runs deterministically — no model calls at runtime, ever.
Tested on 5,680+ invoice-processing tasks: same accuracy as calling the LLM live. At 1,000 transactions, 57x fewer tokens consumed. The break-even point hits at just 17 transactions.
For most repeating business workflows — forms, invoices, structured extraction, function-calling pipelines — the LLM is doing the same reasoning every single time. You're paying inference costs to regenerate logic that never changes.
Here are 7 rules for converting your LLM workflows into compiled AI:
1. Audit your stack for workflows that repeat the same logic. Field extraction, routing rules, format transformations, function dispatch — these are your compilation targets.
2. Write a prompt whose job is to output executable code, not a prose response. The LLM generates the logic once. It doesn't run it.
3. Constrain the LLM's generation to narrow, validated templates. Wider generation scope means harder validation and more failure modes at runtime.
4. Run a four-stage generation-and-validation pipeline: generate, test against known inputs, fix failures, commit. The paper hit 96% task completion this way.
5. Run static code safety analysis on every compiled artifact before it touches production data. The study achieved 87.5% accuracy on safety detection with zero false positives.
6. For document intelligence tasks — invoices, contracts, receipts — compiled code matches or beats live LLM accuracy on structured field extraction (80.4% line-item recognition vs. 80.0% for direct LLM calls).
7. Know your break-even point. Seventeen runs. If a workflow fires more than 17 times, compiled AI is already cheaper. At 1,000 runs, you're at a 57x token reduction.
The LLM call that runs once and powers 1,000 deterministic executions is the highest-leverage AI call you'll write.
🚨A new Wharton research has a name for what's happening to most people using AI.
They call it "cognitive surrender."
When the chatbot was wrong, 80% of subjects accepted the wrong answer anyway.
And they rated their own confidence 11.7% higher than people who never used the chatbot at all.
AI made them confidently wrong.
The researchers call it "System 3" — a third cognitive system that sits alongside Kahneman's fast-intuitive System 1 and slow-analytical System 2. The difference is that System 3 doesn't care whether your output is right. It just makes you feel like it is.
Remember, AI is the tool. You are the judgment layer.
Your AI agent is probably thinking its way into the wrong tool call.
Tsinghua University published a study this week testing six reasoning depths before function calls across 200 tasks.
Brief reasoning (32 tokens) boosted accuracy 45% over no reasoning at all.
Extended reasoning (256 tokens) dropped accuracy 39 percentage points BELOW the no-reasoning baseline.
The same tasks. The same model. More thinking produced worse results than no thinking.
Here's the mechanism:
When an agent thinks briefly before calling a tool, the thought acts as a routing step. It locks in the right function name in the first few tokens, anchors the reasoning, and the call goes through correctly.
When the agent thinks for too long, it starts revising, second-guessing, and eventually hallucinates a function that doesn't exist in the toolset.
At 32 tokens: wrong function selection fell from 30.5% to 1.5%.
At 256 tokens: wrong function selection surged back to 28.0%. Hallucinated functions hit 18.0%.
The model talked itself into errors it wouldn't have made by just answering immediately.
7 rules for function-calling agents, grounded in the data:
1. Cap pre-call reasoning at 8-16 tokens. The fine-grained sweep found 8 tokens delivered a 28% relative improvement and 16 tokens a 57% improvement — both outperforming the 32-token default.
2. Stop assuming more thinking equals better tool selection. On function calls specifically, the relationship is non-monotonic. You peak around 32 tokens, then fall off a cliff.
3. Use FR-CoT for zero-hallucination. The template: "Function: [name] / Key args: [...]". Forces the model to commit to a valid function name before any reasoning begins. Eliminated hallucinated tool calls to 0% in tests.
4. Stop increasing reasoning budget when agents make tool selection errors. More thinking budget will likely make it worse. The error pattern reverses, not improves.
5. Brief reasoning fixes routing. Extended reasoning breaks it. The 8-32 token window anchors the model to a real function in the candidate set. Past that window, it starts drifting.
6. FR-CoT beats constrained decoding. Forcing function name via log-probability scoring left the 7B model at 63.5% accuracy. FR-CoT with structured prompting got to 83.0%. Structure in reasoning beats output-level constraints by 19.5 percentage points.
7. Larger models collapse harder. The 7B model peaked at 82.5% with brief reasoning, then crashed to 18.0% with 256-token reasoning — a 46.5 percentage point drop. The 1.5B model dropped only 39 points. Bigger models generate richer reasoning chains that are harder to override once they've gone off track.
Go check your agent setup today. Find where you set the reasoning budget before each tool call, and cut it to under 32 tokens. If you're already seeing tool hallucination errors, cutting the reasoning budget is your first fix.
A new paper published this week across four frontier LLMs and nine datasets found something important:
when reasoning models fail, they're usually answering a different question than the one you asked.
Their thinking traces drift. Long reasoning chains wander from your original query. By the time the model reaches a conclusion, it's often solved something adjacent to your actual problem.
The researchers built a fix: Trace Inversion. Generate the reasoning trace. Then reconstruct, from the trace alone, the question the model was actually responding to. Compare that to your original. Low similarity score = the answer is probably wrong.
This approach beat every competitive baseline in 33 of 36 test configurations.
You don't need code to use it. Here's how:
1. Ask your question. Let the model think fully.
2. Follow up with: "Based only on your response and reasoning, what question were you actually answering?" Read the reconstructed question against yours.
3. Divergence tells you exactly how to fix your prompt. A broader reconstructed question means the model generalized beyond your scope. Narrower means it solved a simpler version. Both are precise feedback.
4. Prepend high-stakes queries with a scope lock: "Answer only this question: [restate it]. Do not reframe, generalize, or extend the scope."
5. Break multi-part questions into single questions. Compound queries produce the highest drift rates.
6. Ask the model to pinpoint where in its reasoning it committed to an interpretation. That's where the drift started. Correcting it there is more efficient than re-running the full query.
7. Reserve thinking-mode for tasks where this verification step is worth the added time. For simple factual queries, standard chat modes drift less.
8. Verify closure on decisions: "Does your answer specifically address this question: [restate yours]?" Models self-flag mismatches accurately in over 80% of cases.
9. When building prompts for repeated use, run Trace Inversion on the first five outputs. If the model consistently reconstructs a different question, you have a structural prompt problem, not an edge case.
10. Log the reconstructed questions over time. Patterns in how the model drifts from your queries reveal your most common structural prompt weaknesses.
The more complex your question, the higher the probability of drift.
Start with the last reasoning-model answer that felt slightly off. Ask it what question it actually answered. Compare the two.
We've testing Manus AI for the past few weeks and just put together a full walkthrough.
The video covers everything you need to know to get started.
How to set up automations that run on their own while you sleep.
Aaway to turn a messy spreadsheet into a polished visual presentation in minutes
Aand the one downside I think everyone should know about before signing up.
If you're curious about AI agents or trying to figure out which ones are actually worth paying for, watch this and let me know what you think.
Watch the full video here: https://t.co/QVRyA6s9e1
Every time you turn on "structured output" or JSON mode, you're making your AI less accurate.
A February 2026 paper quantified exactly how bad the trade is.
The answer is worse than most people building with AI have any idea.
Here's what's happening under the hood:
Structured output APIs (OpenAI, Anthropic) enforce format by masking invalid tokens at every generation step. Each time the model generates a character, only tokens that keep the JSON valid are allowed through. Sounds harmless. Isn't.
The problem: when the model naturally wants to say "The answer is 14" but you've forced it into {"answer": first, the opening brace { has very low probability. That forces the system to renormalize across the entire vocabulary. Same thing happens at every quote, every comma, every field name. The constraints don't just clean up the output — they actively distort the reasoning path, token by token.
The distortion compounds. Each step forces the model through an unlikely prefix that shifts all downstream probabilities. You end up with output that is perfectly formatted and semantically wrong.
The paper measured it across six models (1B to 14B), four benchmarks, multiple constraint types. Standard constrained decoding dropped accuracy 10–30% compared to unconstrained generation.
The fix is a two-call pattern:
Let the model reason freely with no schema, no format requirement
Take that response, feed it back, say "Now format this as JSON: [schema]"
That's it. The paper calls this Draft-Conditioned Constrained Decoding. You can implement the same principle through your prompt today, no inference-level changes required.
Results from the paper:
1B model on GSM8K: structured output alone = 15.2% accuracy. Two-call pattern = 39.0%. That's +24 percentage points on the same model, just by separating reasoning from formatting.
1.5B model: 49.4% → 73.9%. Two calls on a tiny model beats one call on a larger one.
3B model: 73.2% → 84.5%. The gains hold even at model sizes where you'd think formatting tokens are less disruptive.
14B model: 86.4% → 95.2%. The gains don't disappear at scale — they just compress. High-end models still benefit.
Summarization tasks (non-verifiable): DCCD wins ~80% of pairwise comparisons on quality, faithfulness, and coverage. The effect shows up even when there's no ground truth to measure against.
Why it works: When the free-form answer is already in context for call 2, the formatting tokens become high-probability (the model has already decided what it's saying). Low distortion means the constraint enforcement is nearly invisible.
It scales better with sampling too: If you generate multiple candidates and vote, the two-call pattern benefits more from each additional sample than single-call constrained decoding does.
Parameter efficiency flip: A 1.5B+1.5B two-call composition achieved 253% better accuracy-per-parameter than an 8B model under standard constrained decoding. Less total compute, more correct answers.
Smaller models benefit most: The 1B model's gain was bigger than the 14B model's gain. If you're running small models for cost reasons, this matters even more.
Works across constraint types: JSON schemas, expression grammars, formal logic — the pattern holds regardless of what structure you're enforcing.
One schema change nobody told you to make: split every AI call that produces structured output into two. Call one reasons. Call two formats.
We interviewed a solo founder who hit $100K/year in revenue in 2 months, using @Base44
The part that surprised me: he built the whole thing while still running a full-time animation studio.
We get into how he turned an existing client problem into paying SaaS customers almost immediately, a $0 growth playbook that's still driving consistent inbound, and what actually breaks when you try to scale a real product without writing code.
Thank you to Base44 for sponsoring this video.
Watch the full video here: https://t.co/mBbyaZMED0