Benchmarking Frontier LLMs on Chess
Over the weekend I built a series of evals to understand how language models reason about endgames, tactics, and full chess games against strong opponents. Turns out they are getting pretty good!
https://t.co/zRRrD3NfMO
Models will keep improving but the risk of one bad prompt wrecking your app will (very likely) remain nonzero.
Vintage devs just use git. Go-to-bed-grandpa AI PMs will use history rollbacks. And rollbacks w mobile live previews = 🪬
Open sourcing the first @expo vibe coding web IDE and SDK: React Native Vibe Code
Powered by @claudeai agent SDK, history rollbacks, live web and native app previews, full stack setup by @convex, publish to web w/ @Cloudflare, voice prompting, upload assets to app, add images and files to prompt, model selector, skills loader, visual edits, sandboxing by @e2b, download codebase option, Monaco code editor, fork/remix and a CLI to run locally.
The project is a @turborepo running @nextjs hosted on @vercel with streaming powered by @aisdk
◆ try cloud version at https://t.co/v8q8TjUake
◆ github repo: https://t.co/BtrpEKzdA1
@DimitriosMitsos@bqbrady@meridian Re: the article above, biggest gap by far has been verifiability.
Context is a bottleneck. But even without pre-indexing, in human readable codebases, exploration feels O(1) relative to the verifiability challenge.