🚀 BIG ANNOUNCEMENT
I’m thrilled and proud to share that we've released all the resources and work developed over the past two years at the Tbilisi AI Lab
What’s inside:
- 4 powerful models in 3.8B & 12B sizes — pretrained, instruct, and fully aligned.
- 20+ datasets: pretraining data, 3M instruction examples, 0.5M DPO pairs, plus function-calling and specific instruct datasets
- Benchmarks: all classical benchmarks translated into Georgian
6️⃣ Things to Know about
AI Engineer World's Fair 2026
- It’s bigger than all previous AIEs
- 4x Larger Expo with 4 Expo stages
- Researchers: Poster sessions & Poaster sessions
- AI Leadership: Token Billionaires & Off the Record
- AI Verticals: Healthcare, GTM, FDE, AGC, Finance
- Side Events: NEO, Kids day
- attendees get $40k in credits to try everything our sponsors have to offer!
It's going to be our BIGGEST show yet!
I’m excited to share that I’ll be joining OpenAI and look forward to working with the exceptional team there.
It was a difficult decision to move on. I’m incredibly proud of the amazing team at Google and everything we’ve built together. It has been an honor and a pleasure to work with all of you.
At OpenAI, we're continuing to bet on Rust as the future of systems programming.
I'm proud to announce that we're making a $600,000 commitment to the Rust Foundation, which combines our Platinum membership with additional support for maintainer efforts across the Rust ecosystem.
Silent truncation is the invisible ceiling on agent reliability.
File reads cap. Tool results cap. Memory indexes cap. The model never knows what it did not see; it just answers on a partial view.
Make the cut visible before you blame the model.
According to Grok, Andrej Karpathy is an EB-1 extraordinary ability green card recipient, not a US citizen. Thus under these new restrictions he is not permitted to use, or work on, Mythos 5 or Fable 5 as of 5:21pm tonight.
every team has a feedback loop.
It might be an eval set, manual QA, A/B tests, or just opening the app and vibe-checking
The question is how late it catches regressions:
before launch,
after launch,
or after users complain
Treat the agent as a user too, not just what does the human need?
what does the agent need to see,
call,
verify,
and recover from?
Bad tool surfaces make smart models look dumb.
A memory bug pattern I keep seeing is asking the model to rewrite too much.
If only one preference changes, just update that one slot don’t regenerate the entire profile.
Small writes are boring.
Boring is good for memory.
Recently, we purchased one of each Anthropic/OpenAI subscription plan and randomly ran long horizon coding tasks until we exhausted the weekly limit. It's widely believed that a $200/month plan maxes out at ~$2000/month worth of tokens (assuming API pricing). However, we found that the subscriptions are actually far more generous. (2/4)
Flag when the agent is losing the thread, not just when token usage is high. It shouldn’t be that hard to surface real trajectory warnings: repeated failed searches, contradictory context, old decisions resurfacing, and compaction risk.
Rewind is useful.
Knowing when to rewind is the hard part.
Games give instant feedback: illegal move -> blocked. Code has tests. Travel has no game engine.
So we built the verifier ourselves: domain rules, guardrails, business logic.
The domains where agents add the most value are the ones where the harness is hardest to build.
Fable 5 is the biggest step up I’ve felt in our models since Opus 4.5 back in November. After 4.5 came out I uninstalled my IDE when I realized that I’d been doing 100% of my coding in a terminal for a few weeks. With Fable, it’s felt like Claude has stepped up from being a coding agent to a thought and design partner in building the product. Fable has judgement, taste, and dimensionality in a way that previous models didn’t, leading me to trust it more with the most complex work.
I think the first time I had this realization was when I asked Fable to debug something. It is the first model I have used that was so methodical and precise, taking measurements and adding logs then verifying that it truly fixed the issue before declaring victory.
There’s nothing in claude code’s prompting telling the model to do that, it’s just part of its personality. It really has this “big model smell” that I haven’t felt before.
Don't ask an agent "is this done?" That lets it close the loop too early.
Ask it to find the bug, the smaller diff, the simpler path. Now it has a search problem, not a yes/no permission slip
I do this with codex all the time. Ask it to review code for bugs and it will tell you all good, tell it there is a bug and it will LOOP AND LOOP and will find issues.