@htihle You should run a subset of problems on the official platform to make sure there is no provider issue. Otherwise the result is not very convincing.
DeepSeek V4 Pro just matched GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~8× cheaper.
First Chinese model in our frontier tier. The China–US gap that used to feel like a year is now ~10 weeks.
https://t.co/qIOB1PwWdR