Bro it’s June 2026. Stop hand editing your prompts. Hold down the dictation button and ramble for 10 minutes. Give the model every fragment, caveat, example, and vibe in your head. It is literally a large language model. If it’s superhuman at anything, it’s reconstructing latent intent from language.
GPT-5.5-Cyber is our most capable cyber model yet, designed for advanced, authorized defensive work: tracing vulnerable code, validating issues, developing patches, and preparing evidence for human review.
We've kept hearing how GLM-5.2 beats Opus 4.8, and are skeptical of benchmarks - so we tested them on a real bug from the Cline repo. While both models fixed the issue, GLM was the winner in terms of cost and code quality:
- GLM used twice as many tokens (GLM 1.1m vs Opus 660K) but cost half as much (GLM $0.41 vs Opus $0.81)
- Opus finished quicker - 1.6 min and 12 tool calls vs GLM 4.7 min and 28 tool calls
- GLM cleaned up dead code and verified the build compiled before completing. Opus didn't - it left type errors that passed tests but broke the production build.
Both runs used the same Cline harness prompting and tools, so it seems GLM is RL trained to spend more tokens verifying its work before completing. Impressive work by the @Zai_org team!