After testing GPT-5 (Pro subscription) since launch on real work (coding, research, prod reviews), here’s a straight, no-fluff take.
TL;DR
GPT-5 Thinking is the best “facts + web + synthesis” model I’ve used so far. GPT-5 Pro feels like a staff/principal engineer doing risk reviews. Codex CLI is now the default implementer; Claude Code is the reviewer. Hallucinations are near-zero when used correctly. Not a silver bullet; still needs tests and discipline.
Precision & Web
•Retrieves, verifies, and compresses web info with very low hallucination rate.
•Better than “Deep Research”-style flows tried before: faster to the point, more signal per token, fewer detours.
•o3 was already elite at reasoning; GPT-5 Thinking is o3+, with deeper nuance and tighter sourcing.
Model Routing
•The picker/router was confusing early on, so GPT-5 Thinking is the default for anything non-trivial.
•The non-thinking variant is only used for universally known facts (“Who was Marcus Aurelius?”).
•Non-thinking struggled on math/physics stress prompts (5 fails on a standard test prompt used across models). It’s not the right tool for formal derivations.
GPT-5 Pro = Production Guardian
•Workflow: build an implementation plan with Claude Code + GPT-5 via Codex CLI (after giving repo context) → hand the plan to GPT-5 Pro.
•What happens: GPT-5 Pro spots production-grade failures before they happen: race conditions, idempotency gaps, edge-case input handling, flaky retries, concurrency pitfalls, security regressions.
•The difference: not generic “lint”; it flags the exact line of failure and the real-world blast radius (e.g., webhook replay + partial DB commit = phantom charges). That’s principal-engineer-level scrutiny.
•Hallucinations were effectively zero in these reviews; citations and reasoning held up under adversarial checks.
Codex CLI vs Claude Code
•Codex CLI is slower than Claude Code at times and can feel conservative, but it’s more solid and avoids nonsensical diffs.
•Best pattern found: Codex CLI as the implementer, Claude Code as the second-opinion reviewer focused on clarity and refactors. Net effect: fewer regressions, cleaner merges.
Props to @embirico for keeping in touch with the community and @OpenAI for giving subscription usage instead of just api. This product has improved significantly in such a short period of time.
Where GPT-5 Thinking Shines
•Web-backed briefs, competitive scans, RFC-style design notes, failure-mode analysis, and “compress the internet into what matters” tasks.
•It consistently catches the subtle stuff o3 sometimes missed and keeps the write-ups crisp.
Limitations & Caveats
•Non-thinking ≠ math engine; use Thinking/Pro for formal reasoning or back it with a CAS/test harness.
•Speed can vary; don’t block delivery on a single long run but stage work and keep tests green.
•Never outsource judgment: enforce idempotency, add invariants, run chaos/replay tests, and treat outputs as proposals until the CI proves them.
Verdict
This is the first time an LLM actually felt like a staff/principal engineer on call 24/7. For this use case, shipping reliable software with real stakes, GPT-5 is an upgrade over o3 in depth, subtlety, and truthfulness. Expectations exceeded.
@thsottiaux@embirico Please add GPT-5 Pro to Codex CLI. Codex CLI (all models/reasoning) and Claude Code (all, ultrathink) couldn’t solve a stubborn Channex room_type 500. GPT-5 Pro solved it in 1 prompt: drop id/property_id from update body; update photos by ID or is_removed; positions zero-based & unique. No other model solved it (tried all models on cursor too). All AI models suggested contacting https://t.co/dFOv9KBfDT support as they could not find why it was throwing 500, it was not documented in official api documentation either. The GPT 5 Pro is on another level and it would be a huge time saver on cli.
@thdxr no, it’s more than that. the “mistake” is still there, cc performance is still terrible during peak hours and they do not admit that. according to them everything is performing fine now, but if you try to work with cc in 4 hours from now it can’t do a hello world page properly.
OpenAI released gpt-5 codex and it was slow for one day, they reset the limits for everybody. Accepted the problem solved it in one day. Anthropic is still serving degraded models to users, just log in 8-10PM ET and see if it does anything more than a Hello World. Nothing Changed!
@AnthropicAI No where to report it on your website, but claude models (both opus 4.1 and sonnet 4) on claude code are really bad for the past hour. Yesterday and today same hours. At least don’t say in the status page that models are working fine. gpt3 level atm
ok great, claude code was working very well today and suddenly i feel like it switched to gpt 3. it’s unusable right now and nothing on the anthropic page about it.
@unclebobmartin try asking it questions about different animals, then ask how humans are different. it will always say “we” for humans, referring to itself as human.
@aidan_mclau I still use both claude code and codex cli. Claude code is only for computer stuff now, like helping with vps config and other things it’s great, but i use codex cli for coding as claude code has degraded in quality completely. GPT 5 High for coding is the best by far.
@CGoodman308 This is the reason you should stop using 4o. It hasn’t changed, but it will say whatever you want it to say and make you believe what you want to believe.
@sama 5 Pro in codex cli ( even if its a few prompts /5hrs). I use it a lot for planning and it’s a genius model, but having to give it context on web is a pain
2- /reason to pick the reasoning effort.
3- Windows support
4- Web tools and ability to make web calls
@Grummz Could you tell us more how you are building with Grok? Are you using Cursor or some other tool? My codebase is over 100k LOC and a LOT of files so I’m not sure how I could build with grok.