any papers with main results benchmarking math on qwen models i don’t trust, these models are so overfitted on the test sets that you could do any training and it would improve performances
anthropic running the exact same marketing playbook with every release. “our model is so capable and dangerous, ahh we are afraid to release it”. just put the model in the bag lil bro.