You saw a lot of benchmarks that GLM-5 and MiniMax-M2.5 are basically the same as Opus/GPT/Gemini. But, from doing these tests there are few problematic patterns:
- Coherence: GLM/Minimax often unable to have a fully coherent view of the whole generation, elements are often mismatched
- Stability: you can get good stuff out, but in my tests about half the time you get something really weird, which isn't the case with e.g. Opus
- Word knowledge: generations are often about 30% off the real thing - like an off brand version
It doesn't mean the models are bad though, but we shouldn't get carried away, they do not yet match Opus/GPT/Gemini.